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Abstract 

This chapter presents the science of “Collective INtelligence” (COIN). A COIN 
is a large multi-agent systems where: 

i) the agents each run reinforcement learning (RL) algorithms; 

ii) there is little to no centralized communication or control; 

iii) there is a provided world utility function that rates the possible histories of the 
full system. 

The conventional approach to designing large distributed systems to optimize a world 
utility does not use agents running RL algorithms. Rather that approach begins with 
explicit modeling of the overall system’s dynamics, followed by detailed hand-tuning 
of the interactions between the components to ensure that they “cooperate” as far 
as the world utility is concerned. This approach is labor-intensive, often results in 
highly noil-robust systems, and usually results in design techniques that have limited 

applicability. 

In contrast, with COINs we wish to solve the system design problems implicitly, 
via the ‘adaptive’ character of the RL algorithms of each of the agents. This COIN 
approach introduces an entirely new, profound design problem: Assuming the RL 
algorithms are able to achieve high rewards, what reward functions for the individual 
agents will, when pursued by those agents, result in high world utility? In other 
words, what reward functions will best ensure that we do not have phenomena like 
the tragedy of the commons, or Braess’s paradox? 

Although still very young, the science of COINs has already resulted in successes in 
artificial domains, in particular in packet-routing, the leader-follower problem, and 
in variants of Arthur’s “El Farol bar problem”. It is expected that as it matures 
not only will COIN science expand greatly the range of tasks addressable by human 
engineers, but it will also provide much insight into already established scientific 
fields, such as economics, game theory, or population biology. 
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1 INTRODUCTION 

Over the past decade or so two developments have occurred in computer science whose 
intersection promises to open a vast new area of research, an area extending far beyond 
the boundaries of conventional computer science. The first of these developments is the 
growing realization of how useful it would be to be able to control distributed systems 
which have little (if any) centralized communication, and to do so ‘adaptively , with 
minimal reliance on detailed knowledge of the system’s small-scale dynamical behavior. 
This realization has been most recently manifested in the field of amorphous computing 
[1 The second development is the maturing of the discipline of reinforcement learning 
(RL). This is the branch of machine learning that is concerned with an agent who 
periodically receives ‘reward’ signals from the environment that partially reflect the 
value of that agent’s personal utility function. The goal of RL is to determine low, 
using those reward signals, the agent should update its action policy so as to maximize 

its utility [127, 230, 243]. 

Intuitively, one might hope that the tool of RL would help us solve the distributed 
control problem, since RL is adaptive, and in particular since it is not restricted to 
domains having sufficient breadths of communication. However by itself, conventional 
single-agent RL does not provide a means for controlling large, distributed systems. This 
is true even if the system does have centralized communication. The problem is that 
the space of possible action policies for such systems is too big to be seal died. We 
might imagine as a variant using a large set of agents, each controlling only part of the 
system Since the individual action spaces of such agents would be relatively small, we 
could realistically deploy conventional RL on each one. However now we face the central 
question of how to map the world utility function concerning the overall system into 
personal utility functions for each of the agents. In particular, how should we design 
those personal utility functions so that each agent can realistically hope to optimize i s 
function, and at the same time the collective behavior of the agents will optimize the 

world utility? 

We use the term “Collective INtelligence” (COIN) to refer to any pair of a large, 
distributed collection of interacting RL algorithms among which there is little to no 
centralized communication or control, together with a world utility function that rates 
the possible dynamic histories of the collection. The central COIN design problem is how, 
without any detailed modeling of the overall system, you can set the utility functions 
for the RL algorithms in a COIN to have the overall dynamics reliably and robustly 
achieve large values of the provided world utility. The benefits of an answer to this 
question would extend beyond the many branches of computer science, having major 
ramifications for many other sciences as well. The next section discusses some of those 
benefits. The following section reviews previous work that has bearing on t e 
design problem. The final section constitutes the core of this chapter. It presents a 
quick outline of a promising mathematical framework for addressing this problem, an 
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then experimental illustrations of the prescriptions of that framework. Throughout we 
will use italics for emphasis, single quotes for informally defined terms, and double quotes 
to delineate colloquial terminology. 


2 Background 

There are many design problems that involve distributed computational systems where 
there are strong restrictions on centralized communication (‘we can’t all talk’); or there is 
communication with a central processor, but that processor is not sufficiently powerful to 
determine how to control the entire system (‘we aren’t smart enough’); or the processor 
is powerful enough in principle, but it is not clear what algorithm it could run by itself 
that would effectively control the entire system (‘we don’t know what to think’). Just a 
few of the potential examples include: 

i) Designing a control system for constellations of communication satellites or of 
constellations of planetary exploration vehicles (world utility in the latter case being 
some measure of quality of scientific data collected); 

ii) Designing a control system for routing over a communication network (world utility 
being some aggregate quality of service measure); 

iii) Construction of parallel algorithms for solving numerical optimization problems 
(the optimization problem itself constituting the world utility); 

iv) Vehicular traffic control, e.g., air traffic control, or high-occupancy-toll-lanes for 
automobiles. (In these problems the individual agents are humans and the associated 
utility functions must be of a constrained form, reflecting the relatively inflexible kinds 
of preferences humans possess.); 

v) Routing over a power grid; 

vi) Control of a large, distributed chemical plant; 

viii) Control of the elements of an amorphous computer; 

ix) Control of the elements of a ‘noisy’ phased array radar; 

x) Compute-serving over an information power grid. 

Such systems may be best controlled with an artificial COIN. The potential useful- 
ness of solving the COIN design problem extends far beyond such engineering concerns 
however. That’s because the COIN design problem is an inverse problem, whereas essen- 
tially all of the scientific fields that are concerned with naturally-occurring distributed 
systems analyze them purely as a “forward problem”. That is, those fields analyze what 
global behavior would arise from provided local dynamical laws, rather than grapple 
with the inverse problem of how to configure those laws to induce desired global be- 
havior. It seems highly likely that the insights garnered from understanding the inverse 
problem would provide a trenchant novel perspective on those fields. Just as tackling 
the inverse problem in the design of steam engines lead to the first true understanding of 
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the macroscopic properties of physical bodes (aka thermodynamics), so may the crack- 
ing of the COIN design problem hopefully would augment our understanding of many 
naturally-occurring COINs. 

As an example, consider countries with capitalist human economies. Such systems 
can be viewed as naturally occurring COINs. One can declare ‘world utility’ to be a time 
average of the Gross Domestic Product (GDP) of the country in question. (World utility 
per se is not a construction internal to a human economy, but rather something defined 
from the outside.) The reward functions for the human agents are the achievements of 
their personal goals (usually involving personal wealth to some degree). As commonly 
understood, the economy of the United Stated in the 1990’s, or of Japan through much 
of the 1970’s and 1980’s, serves as an existence proof that the COIN design problem has 
solutions. 

Now in general, to achieve high global utility in a COIN it is necessary to avoid having 
the agents work at cross-purposes, lest phenomena like the Tragedy of the Commons 
(TOC) occur, in which individual avarice works to lower global utility [102]. One way 
to avoid such phenomena is by modifying the agents’ utility functions. In the context 
of capitalist economies, outcomes can be modified via punitive legislation. A real world 
example of an attempt to make just such a modification was the creation of anti-trust 
regulations designed to prevent monopolistic practices. 

In designing a COIN we usually have more freedom than anti-trust regulators though, 
in that there is no base-line “organic” local utility function over which we must superim- 
pose legislation-like incentives. Rather, the entire “psychology’ of the individual agents 
is at our disposal, when designing a COIN. This obviates the need for honesty-elicitation 
(‘incentive compatible’) mechanisms, like auctions, which form a central component of 
conventional economics. Accordingly, COINs can differ in certain crucial respects from 
human economies. The precise differences — the subject of current research — seem 
likely to present many insights into the functioning of economic structures like anti-trust 
regulators. 

Another example of the novel perspective of COINs, also concerning human economies, 
is the usefulness of (commodity, or especially fiat) money. The traditional economics view 
is that money is useful because it is portable; universally valued (and therefore minimizes 
the number of “trading posts” needed [223]); allows “middlemen” to facilitate resource 
allocation, and the like. The COIN perspective however leads us to address lower-level 
aspects of the usefulness of money. For example, formally, ‘money’ constitutes a par- 
ticular class of couplings between the states and utility functions of the various agents. 
Now for any underlying system any particular choice of utility functions for the agents 
— like utility functions involving money — will induce high levels of some world utili- 
ties. But it will simultaneously induce lotv levels of world utilities. This raises a host 
of questions, like how to formally specify the most general set of world utilities which 
benefit significantly from money-based local utility functions form the class of such func- 
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tions involving money. If one is provided a world utility that is not a member of that 
set, then an “economics-inspired” configuration of the system is likely to result in poor 
performance. 

There are many other scientific fields which are currently under investigation from a 
COIN-design perspective. Some of them are, like economics, part of (or at least closely 
related to) the social sciences. These fields typically involve RL algorithms under the 
guise of human agents. (An example is game theory, especially game theory of bounded 
rational players.) 

Of course, real-world economies are “emergent” and don’t have externally imposed 
world utilities, like time-average of GDP. Rather such utilities are an analytic tool that 
an understanding of COINs would exploit to gain insight into the functioning of human 
economies. There are other scientific fields that might benefit from a COIN-design 
perspective even though they study systems that don’t even involve RL algorithms. 
The idea here is that if we viewed such systems from a teleological perspective, both in 
concentrating on a world utility and in casting the nodal elements of the system as RL 
algorithms, we could learn a lot about the form of the 'design space’ in which such systems 
live. Examples here are ecosystems (individual genes, individuals, or species being the 
nodal elements) and cells (individual organelles in Eukaryotes being the nodal elements). 
In both cases, the world utility could involve robustness of the desired equilibrium against 
external perturbation, efficient exploitation of free energy in the environment, etc. 


3 Review of Literature Related to COINs 

There are many different features that characterize what we mean by a “COIN”. The 
first four features in the following list are definitional; the remainder are not definitional 
per se, but are fundamental to the sorts of COINs we are concerned with in this chapter. 

1) There are many processors running concurrently, performing actions that affect 
one another’s behavior. 

2) There is little to no centralized personalized communication, i.e., little to no behav- 
ior in which a small subset of the processors communicates with all the other processors, 
but communicates differently with each one of those other processors. Any single pro- 
cessor’s “broadcasting” the same information to all other processors is not precluded. 

3) There is little to no centralized personalized control, i.e., little to no behavior in 
which a small subset of the processors controls all the other processors, but controls each 
one of those other processors differently. “Broadcasting” the same control signal to all 
other processors is not precluded. 

4) There is a well-specified task, typically in the form of extremizing a utility function, 
that concerns the behavior of the entire distributed system. So we are confronted with 
the inverse problem of how to configure the system to achieve the task. 
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5) The individual processors are running RL algorithms. 

6) The approach for tackling (4) is scalable to very large numbers of processors. 

7) The approach for tackling (4) is very broadly applicable. In particular, it can work 
when little (if any) “broadcasting” as in (2) and (3) is possible. 

8) The approach for tackling (4) involves little to no hand-tailoring. 

9) The approach for tackling (4) is robust and adaptive, with minimal need to “get 
• the details exactly right or else”, as far as the stochastic dynamics of the system is 

concerned. 

The rest of this section reviews some of the fields that are related to COINs, and in 
particular characterizes them in terms of this list of nine characteristics of COINs. 

3.1 AI and Machine Learning 

3.1.1 Reinforcement Learning 

As discussed in the introduction, the maturing field of reinforcement learning provides a 
much needed tool for the types of problems addressed by COINs. Because RL generally 
provides model-free 1 and “online” learning features, it is ideally suited for the distributed 
environment where a “teacher” is not available and the agents need to learn successful 
strategies based on “rewards” and “penalties” they receive from the overall system at 
various intervals. It is even possible for the learners to use those rewards to modify how 
they learn [207]. 

Although work on RL dates back to Samuel’s checker player [199], relatively recent 
theoretical [243] and empirical results [57, 235] have made RL one of the ‘hottest’ 
areas in machine learning. Many problems ranging from controlling a robot’s gait to 
controlling a chemical plant to allocating constrained resource have been addressed with 
considerable success using RL [100. 119, 172, 190, 261]. In particular the RL algorithms 
TD{ A) (which rates potential states based on a value function) [230] and ^-learning 
(which rates action-state pairs) [243] have been investigated extensively. A detailed 
investigation of RL is available in [127, 231, 243]. 

Although powerful and widely applicable, solitary RL algorithms will not perform 
well on large distributed heterogenous problems in general. This is due to the very big 
size of the action-policy space for such problems. In addition, without centralized com- 
munication and control, how a solitary RL algorithm could run the full system at all, 
poorly or well, becomes a major concern. 2 For these reasons, it is natural to consider 
deploying many RL algorithms rather than a single one for these large distributed prob- 

1 There exist some model-based variants of traditional RL. See for example [8]. 

2 One possible solution would be to run the RL off-line on a simulation of the full system and then 
convey the results to the components of the system at the price of a single centralized initialization (e.g., 
[171]). In general though, this approach will suffer from being extremely dependent on “getting the 
details right” in the simulation. 
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leins. We will discuss the coordination issues such an approach raises in conjuction with 
multi-agent systems in Section 3.1.3 and with learnability in COINs in Section 4. 

3.1.2 Distributed Artificial Intelligence 

The field of Distributed Artificial Intelligence (DAI) has arisen as more and more tra- 
ditional Artificial Intelligence (AI) tasks have migrated toward parallel implementation. 
The most direct approach to such implementations is to directly parallelize AI production 
systems or the underlying programming languages [84, 196]. An alternative and more 
challenging approach is to use distributed computing, where not only are the individual 
reasoning, planning and scheduling AI tasks parallelized, but there are different modules 
with different such tasks, concurrently working toward a common goal [123, 124, 146]. 

In a DAI, one needs to ensure that the task has been modularized in a way that im- 
proves efficiency. Unfortunately, this usually requires a central controller whose purpose 
is to allocate tasks and process the associated results. Moreover, designing that con- 
troller in a traditional AI fashion often results in brittle solutions. Accordingly, recently 
there has been a move toward both more autonomous modules and fewer restrictions on 
the interactions among the modules [202]. 

Despite this evolution, DAI maintains the traditional AI concern with a pre-fixed set 
of particular aspects of intelligent behavior (e.g. reasoning, understanding, learning etc.) 
rather than on their cumulative character. As the idea that intelligence may have more 
to do with the interaction among components started to take shape [42, 43], focus shifted 
to concepts (e.g., multi agent systems) that better incorporated that idea [125]. 

3.1.3 Multi-Agent Systems 

The field of Multi- Agent Systems (MAS) is concerned with the interactions among the 
members of such a set of agents [40, 96, 125, 211, 232], as well as the inner workings 
of each agent in such a set (e.g., their learning algorithms) [36, 37, 38]. As in compu- 
tational ecologies and computational markets (see below), a well-designed MAS is one 
that achieves a global task through the actions of its components. The associated design 
steps involve [125]: 

1. Decomposing a global task into distributable subcomponents, yielding tractable 
tasks for each agent; 

2. Establishing communication channels that provide sufficient information to each 
of the agents for it to achieve its task, but are not too unwieldly for the overall 
system to sustain; and 

3. Coordinating the agents in a way that ensures that they cooperate on the global 
task, or at the very least does not allow them to pursue conflicting strategies in 
trying to achieve their tasks. 
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Step (3) is rarely trivial; one of the main difficulties encountered in MAS design is 
that agents act selfishly and artificial cooperation structures have to be imposed on 
their behavior to enforce cooperation [11]. An active area of research is to determine 
how selfish agents’ “incentives” have to be engineered in order to avoid the TOC [216]. 
When simply providing the right incentives is not sufficient, one can resort to strategies 
that actively induce agents to cooperate rather than act selfishly. In such cases coor- 
dination [212], negotiations [137], coalition formation [201, 203, 263] or contracting [2] 
among agents may be needed to ensure that they do not work at cross purposes. 

Unfortunately, all of these approaches share with DAI and its offshoots the problem 
of relying excessively on hand-tailoring, and therefore being difficult to scale and often 
non-robust. In addition, except as noted in the next subsection, they involve no RL. 

3.1.4 Reinforcement Learning-Based Multi- Agent Systems 

Because it neither requires explicit modeling of the environment nor having a “teacher” 
that provides the “correct” actions, the approach of having the individual agents in a 
MAS use RL is well-suited for MAS’s deployed in domains where one has little knowledge 
about the environment and/or other agents. There are two main approaches to designing 
such MAS’s: 

(i) One has ‘solipsistic agents’ which don’t know about each other and whose RL rewards 
are given by the performance of the entire system (so the joint actions of all other agents 
form an “inanimate background” contributing to the reward signal each agent receives); 

(ii) One has ‘social agents’ that explicitly model each other and take each others’ actions 
into account. 

Both (i) and (ii) can be viewed as ways to (try to) coordinate the agents in a MAS in a 
robust fashion. 

Solipsistic Agents: MAS’s with solipsistic agents have been successfully applied to a 
multitude of problems [57, 99, 109, 200, 206]. Generally these schemes use RL algorithms 
similar to those discussed in Section 3.1.1. However much of this work lacks a well defined 
global task or broad applicability (e.g., [200]). More generally, none of the work with 
solipsistic agents scales well. The problem is that each agent must be able to discern 
the effect of its actions on the overall performance of the system, since that performance 
constitutes its reward signal. As the number of agents increases though, the effects of 
any one agent’s actions (signal) will be swamped by the effects of other agents (noise), 
making the agent unable to learn well, if at all. (See the discussion below on lear liability.) 
In addition, of course, solipsistic agents cannot be used in situations lacking centralized 
calculation and broadcast of the single global reward signal. 

Social agents: MAS’s whose agents take the actions of other agents into account syn- 
thesize RL with game theoretic concepts (e.g., Nash equilibrium). They do this to try 
to ensure that the overall system both moves toward achieving the overall global goal 
and avoids oscillatory behavior [56, 88, 117, 118]. To that end, the agents incorporate 
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internal mechanisms that actively model the behavior of other agents. In Section 3.2.5 
we discuss a situation where such modeling is necessarily self-defeating. More gener- 
ally, this approach suffers from being narrowly applicable, requiring hand tailoring, and 
potentially not scaling well. 

3.2 Social Science-Inspired Systems 

Economics provides more than examples of naturally occurring systems that can be 
viewed as a (more or less) well-performing COIN. Both empirical economics (e.g., eco- 
nomic history, experimental economics) and theoretical economics (e.g., general equilib- 
rium theory [3], theory of optimal taxation [165]), provide a rich literature on how to 
study strategic situations where many parties interact. 

In this section we summarize the two economics concepts that are probably the most 
closely related to COINs, in that they deal with how a large number of interacting agents 
can function in a stable and efficient manner: general equilibrium theory and mechanism 
design. We then discuss general attempts to apply those concepts to distributed compu- 
tational problems. We follow this with a discussion of game theory, and then present a 
particular celebrated toy-world problem that involves many of these issues. 

3,2.1 General Equilibrium Theory 

Often the first version of “equilibrium” that one encounters in economics is that of 
supply and demand in single markets: the price of the market’s good is determined by 
where the supply and demand curves for that good intersect. In cases where there is 
interaction among multiple markets however, one cannot simply determine the price of 
each market’s good individually, as both the supply and demand for each good depends 
on the supply /demand of other goods. Considering the price fluctuations across markets 
leads to the concept of ‘general equilibrium’, where prices for each good are determined 
in such a way to ensure that all markets 'clear’ [3, 222]. Intuitively, this means that 
prices are set so the total supply of each good is equal to the demand for that good 3 . The 
existence of such an equilibrium, proven in [3], was first postulated by Leon Walras [242]. 
A mechanism that calculates the equilibrium (i.e., market-clearing) prices now bears his 
name: the Walrasian auctioner. 

3 More formally, each agent’s utility is a function of that agent’s allotment of all the possible goods. 
In addition, every good has a price. (Utility functions are independent of money.) Therefore, for any set 
of prrices for the goods, every agent has a ‘budget’, given by their initial allotment of goods. We pool 
all the agents’ goods together. Then we set prices for all of those goods, and allocate the goods back 
among the agents in such a way that each agent is given a total value of goods (as determined by the 
prices) equal to that agent’s budget (as determined by the prices and by that agent’s initial allotment). 
‘Markets clear’ at those prices for which all the initial goods are reallocated back among the agents (no 
“excess supply”) and for which each agent views its allocation of goods as optimizing its utility, subject 
to its budget and to those prices for the goods (no “excess demand”). 
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As a model of real-world interactions between agents, general equilibrium theory 
suffers from having no temporal aspect (i.e., no dynamics) and from assuming that all 
the agents are perfectly rational. Another shortcoming of general equilibrium theory is 
that it does not readily accommodate the concept of money [87]. Of the three main roles 
money plays in an economy (medium of exchange in trades, store of value for future 
trades, and unit of account) none are essential in a general equilibrium setting. The 
unit of account aspect is not needed as the bookkeeping is performed by the Walrasian 
auctioner. Since the supplies and demands are matched directly there is no need to 
facilitate trades, and thus no role for money as a medium of exchange. And finally, as 
the system reaches an equilibrium in one step, through the auctioner, there is no need 
to store value for future trading rounds [150]. 

The reason that money is not needed can be traced to the fact that there is an 
“overseer” with global information who guides the system. If we remove the centralized 
communication and control exerted by this overseer, then (as in a real economy) agents 
will no longer know the exact details of the overall economy. They will be forced to 
makes guesses as in any learning system, and the differences in those guesses will lead 
to differences in their actions [140, 141]. 

Such a decentralized learning- based system more closely resembles a COIN than does 
a general equilibrium system. In contrast to general equilibrium systems, the three main 
roles money plays in a human economy are crucial to the dynamics of such a decentralized 
system [12]. This comports with the important effects in COINs of having the agents 
utility functions involve money (see Background section above). 

3.2.2 Mechanism Design 

The field of mechanism design encompasses auctions, monopoly pricing, optimal taxation 
and public good theory [138]. It is concerned with the incentives that must be applied 
to any set of agents that interact and exchange goods [165, 238] in order to get those 
agents to exhibit desired behavior. Usually that desired behavior concerns pre-specified 
utility functions of some sort for each of the individual agents. In particular, mechanism 
desgin is concerned with 'efficient' incentive schemes which ensure that all bidders in an 
auction “benefit” from the outcome, and ‘optimal’ incentive schemes which maximize a 
preset global utility - which for real world markets is a inonotonically increasing function 
in all its arguments. 

One particularly important type of such incentive schemes is auctions. When many 
agents interact in a common environment often there needs to be a structure that sup- 
ports the exchange of goods or information among those agents. Auctions provide one 
such (centralized) structure for managing exchanges of goods. For example, in the En- 
glish auction all the agents come together and ‘bid’ for a good, and the price of the 
good is increased until only one bidder remains, who gets the good in exchange for the 
resource bid. As another example, in the Dutch auction the price of a good is decreased 
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until one buyer is willing to pay the current price. 

All auctions perforin the same task: match supply and demand. As such, auctions 
are one of the ways in which price equilibriation among a set of interacting agents (per- 
haps an equilibriation approximating general equilibrium, perhaps not) can be achieved. 
However, an efficient auction mechanism does not necessarily maximize the global utility. 
For example, a transaction in an English auction is efficient in that both the seller and 
the buyer benefit from it. However, that the winner may well have been willing to pay 
more for the good, can confound the goal of the market designer. 

3.2.3 Computational Economics 

‘Computational economies’ are economics-inspired schemes for managing the compo- 
nents of a distributed computational system, which work by having a ‘computational 
market’ guide the interactions among those components. Such a market is defined as 
any structure that allows the components of the system to exchange information on rela- 
tive valuation of resources (as in an auction), establish equilibrium states (e.g., determine 
market clearing prices) and exchange resources (i.e., engage in trades). 

Such computational economies can be used to investigate real economies and biolog- 
ical systems [132, 35, 34, 30]. They can also be used design distributed computational 
systems. For example, such computational economies are well-suited to many distributed 
resource allocation problems, where each component of the system can either directly 
produce the “goods” it needs or acquire them through trades with other components. 
Computational markets often allow for far more heterogeneity in the components than 
do conventional resource allocation schemes. Furthermore, there is both theoretical and 
empirical evidence suggesting that such markets are often able to settle to equilibrium 
states. For example, auctions find prices that satisfy both the seller and the buyer which 
results in an increase in the utility of both (else one or the other would not have agreed 
to the sale). Assuming that all parties are free to pursue trading opportunities, such 
mechanisms move the system to a point where all possible bilateral trades that could 
improve the utility of both parties are exhausted. Such a state of the system where 
any change that increases the utility of one agent must decrease the utility of another 
is called ‘Pareto optimal’ [90, 89]. Pareto optimality is particularly useful in systems 
where a global utility function is difficult to calculate or is not available. In such cases, 
trading ceases when all trades that may be beneficial are exhausted, leading to a Pareto 
optimal allocation. Note however, that such allocation are not necessarily unique, and 
a (potentially externally imposed) global utility may still be needed to select among 
different such allocations. 

One example of such a computational economy being used for resource allocation is 
Huberman and Clearwater’s use of a double-blind auction to solve the complex task of 
controlling the temperature of a building. In this case, each agent (individual temper- 
ature controller) bids to buy or sell cool or warm air. This market mechanism leads to 
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an equitable temperature distribution in the system [121]. Other domains where market 
mechanisms were successfully applied include purchasing memory in an operating sys- 
tems [52], allocating virtual circuits [80], “stealing” unused CPU cycles in a network of 
computers [72, 240], predicting option futures in financial markets [189], and numerous 
scheduling and distributed resource allocation problems [139, 145, 217, 227, 246, 247]. 

Computational economics can also be used for tasks not tightly coupled to resource 
allocation. For example, following the work of Maes [154] and Ferber [79], Baum shows 
how by using computational markets a large number of agents can interact and cooperate 
to solve a variant of the blocks world problem [21, 22, 23] 

Viewed as candidate COINs, all market-based computational economics fall short in 
relying on both centralized communication and centralized control to some degree. Often 
that reliance is extreme. For example, the the systems investigated by Baum not only 
have the centralized control of a market, but in addition have centralized control of all 
other non-market aspects of the system. (Indeed, the market is secondary, in that it is 
only used to decide which single expert among a set of candidate experts gets to exert 
that centralized control at any given moment). There has also been doubt cast on how 
well computational economies perform in practice [236], and they also often require 
extensive hand-tailoring in practice. 

3.2.4 Game Theory 

Game theory is concerned with situations where a set of players, each having a local 
utility function and set of actions (strategies), analyze strategies which maximize their 
own utilities [29, 90]. It is important to note that in this context, the global behavior 
arises as an “accident” of the individual players’ choices, in that the players do not 
attempt either directly (take actions to that end) or indirectly (take actions that allow 
other players to take actions to that end) to maximize the global utility. In fact, the 
concept of global utility is defined as a a by-product of players’ utilities, rather than be 
a desirable goal state in its own right. 

In a game where each player analyzes the potential actions at a given time step, 
evaluates them on the basis of corresponding expected local utility and selects the most 
profitable strategy, it is important to study both convergence and equilibrium properties 
of the system [78, 214]. Although there are many types of equilibrium in a game, the 
most commonly used one was formalized by Nash [175]. 

In a Nash equilibrium, each player’s strategy is the optimal response to the other 
player’s strategies. In other words, Alternately, it is a state in the game where no player 
can improve its utility by changing its actions unilaterally. One of the reasons that the 
Nash equilibrium is crucial in the analysis of games, is that it provides “consistent’ 
predictions, i.e., if all parties predict that the game will converge to a Nash equilib- 
rium, no one will benefit by changing strategies [90]. Note however, that a consistent 
prediction does not ensure an equilibrium point where the local utilities are maximized. 
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The study of small perturbations around Nash equilibria from a stochastic dynamics 
perspective presents insight on how to select an equilibrium state when more than one 
are present [156]. 

The strategies that each player has at its disposal are also referred to as pure strategies , 
i.e., one takes a particular action every time one encounters a particular state. If on the 
other hand, a player chooses different strategies in a probabilistic manner, we refer to a 
mixed strategy game. We now state one of the most fundamental results in game theory: 
every finite game has a mixed strategy equilibrium, while it does not necessarily have 
a pure strategy equilibrium [90, 175]. The relevance of this result is in guaranteeing an 
equilibrium solution provided that agents are complex enough not to be restricted to 
always choose the same strategy in any one situation. 

When agents play a game repeatedly, one can study the agent’s performance over 
time and evaluate strategies accordingly. If agents learn to modify their strategies, one 
refers to repeated games with memory. If on the other hand agents have a fixed set of 
strategies but are either removed from the game or allowed to multiply according to their 
accumulated reward, one refers to evolutionary game theory. Within this framework one 
can study the long term effects of strategies such a cooperation and see if they arise 
naturally and if so, under what circumstances [10, 16, 25, 73, 130, 181]; investigate 
the dependence of evolving strategies to the amount of information available to the 
agents [161]; study the effect of communication on the evolution of cooperation [162, 164]; 
and draw parallels with auctions and economic theory [110, 163]. 

3.2.5 El Farol Bar Problem 

The “El Farol” bar problem and its variants provide a clean and simple testbed for 
investigating certain kinds of interactions among agents [4, 49, 213]. In the original 
version of the problem, which arose in economics, at each time step (each “night”), each 
agent needs to decide whether to attend a bar. The goal of the agent in making this 
decision depends on the total attendance at the bar on that night. If the total attendance 
is below a preset capacity then the agent should have attended. Conversely, if the bar 
is overcrowded on the given night, then the agent should not attend. (Because of this 
structure, the bar problem with capacity set to 50% of the total number of agents is also 
known as the ‘minority game’; each agent selects one of two groups at each time step, 
and those that are in the minority have made the right choice). The agents make their 
choices by predicting ahead of time whether the attendance on the current night will 
exceed the capacity and then taking the appropriate course of action. 

What makes this problem particularly interesting is that it is impossible for all agents 
to be perfectly rational in the sense of all correctly predicting the attendance on any 
given night. This is because if most agents predict that the attendance will be low (and 
therefore decide to attend), the attendance will actually high, while if they predict the 
attendance will be high (and therefore decide not to attend) the attendance will be low. 
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(In the language of game theory, this essentially amounts to the property that there are 
no pure strategy Nash equilibria [51, 259].) Alternatively, viewing the overall system 
as a COIN, it has a Prisoner’s Dillema-like nature, in that rational behavior by all the 
individual agents thwarts the global goal of maximizing total enjoyment (defined as the 
sum of all agents’ enjoyment and maximized when the bar is exactly at capacity). 

This frustration effect is similar to what occurs in spin glasses in physics, and makes 
the bar problem closely related to the physics of emergent behavior in distributed sys- 
tems [48, 49, 50, 262]. Researchers have also studied the dynamics of the bar problem 
to investigate economic properties like competition, cooperation and collective behavior 
and especially their relationship to market efficiency [61, 126, 205]. 

3.3 Biologically Inspired Systems 

Properly speaking, biological systems do not involve utility functions and searches across 
them with RL algorithms. However it has long been appreciated that there are many 
ways in which viewing biological systems as involving searches over such functions can 
lead to deeper understanding of them [210, 257]. Conversely, some have argued that the 
mechanism underlying biological systems can be used to help design search algorithms 
[ 111] 4 

These kinds of reasoning which relate utility functions and biological systems have 
traditionally focussed on the case of single a biological system operating in some external 
environment. If we extend this kind of reasoning, to a set of biological systems that 
are co-evolving with one another, then we have essentially arrived at biologically- based 
COINs. This section discusses some of how previous work in the literature bears on this 
relationship between COINs and biology. 

3.3.1 Population Biology and Ecological Modeling 

The fields of population biology and ecological modeling are concerned with the large- 
scale “emergent” processes that govern the systems that consist of many (relatively) 
simple entities interacting with one another [24, 103]. As usually cast, the “simple en- 
tities” are members of one or more species, and the interactions are some mathematical 
abstraction of the process of natural selection as it occurs in biological systems (involving 
processes like genetic reproduction of various sorts, genotypy-phenotype mappings, inter 
and intra-species competitions for resources, etc.). Population Biology and ecological 
modeling in this context addresses questions concerning the dynamics of the resultant 
ecosystem, and in particular how its long-term behavior depends on the details of the 
interactions between the constituent entities. Broadly construed, the paradigm of ecolog- 
ical modeling can even be broadened to study how natural selection and self-regulating 

4 See [153, 252] though for some counter-arguments to the particular claims most commonly made in 
this regard. 
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feedback creates a stable planet-wide ecological environment— Gaia [147]. 

The underlying mathematical models of other fields can often be usefully modified to 
apply to the kinds of systems population biology is interested in [13]. Conversely, the 
underlying mathematical models of population biology and ecological modeling can be 
applied to other non-biological systems. In particular, those models shed light on social 
issues such as the emergence of language or culture, warfare, and economic competition 
[74, 75, 91], They also can be used to investigate more abstract issues concerning the 
behavior of large complex systems with many interacting components [92, 101, 157, 179, 
188]. 

Going a bit further afield, an approach that is related in spirit to ecological modeling is 
‘computational ecologies’. These are large distributed systems where each component of 
the system’s acting (seemingly) independently results in complex global behavior. Those 
components are viewed as constituting an “ecology” in an abstract sense (although much 
of the mathematics is not derived from the traditional field of ecological modeling). 
In particular, one can investigate how the dynamics of the ecology is influenced by 
the information available to each component and how cooperation and communication 
among the components affects that dynamics [120, 122]. 

Although in some ways the most closely related to COINs of the current ecology- 
inspired research, the field of computational ecologies has some significant shortcomings 
if one tries to view it as a full science of COINs. In particular, it suffers from not 
being designed to solve the inverse problem of how to configure the system so as to 
arrive at a particular desired dynamics. This is a difficulty endemic to the general 
program of equating ecological modeling and population biology with the science of 
COINs. These fields are primarily concerned with the “forward problem” of determining 
the dynamics that arises from certain choices of the underlying system. Unless one’s 
desired dynamics is sufficiently close to some dynamics that was previously catalogued 
(during one’s investigation of the forward problem), one has very little information on 
how to set up the components and their interactions to achieve that desired dynamics. In 
addition, most of the work in these fields does not involve RL algorithms, and viewed as a 
context in which to design COINs suffers from a need for hand-tailoring, and potentially 
lack of robustness and scalability. 

3.3.2 Swarm Intelligence 

The field of ‘swarm intelligence’ is concerned with systems that are modeled after so- 
cial insect colonies, so that the different components of the system are queen, worker, 
soldier, etc. It can be viewed as ecological modeling in which the individual entities 
have extremely limited computing capacity and/or action sets, and in which there are 
very few types of entities. The premise of the field is that the rich behavior of social 
insect colonies arises not from the sophistication of any individual entity in the colony, 
but from the interaction among those entities. The objective of current research is to 
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uncover kinds of interactions among the entity types that lead to pre-specified behavior 
of some sort. 

More speculatively, the study of social insect colonies may also provide insight into 
how to achieve learning in large distributed systems. This is because at the level of the 
individual insect in a colony, very little (or no) learning takes place. However across 
evolutionary time-scales the social insect species as a whole functions as if the various 
individual types in a colony had “learned” their specific functions. The “learning” is the 
direct result of natural selection. (See the discussion on this topic in the subsection on 
ecological modeling.) 

Swarm intelligences have been used to adaptively allocate tasks in a mail com- 
pany [32], solve the traveling salesman problem [64, 65] and route data efficiently in 
dynamic networks [31, 208, 229] among others. Despite this, such intelligences do not 
really constitute a general approach to designing COINs. There is no general framework 
for adapting swarm intelligences to maximize particular world utility functions. Accord- 
ingly, such intelligences generally need to be hand-tailored for each application. And 
after such tailoring, it is often quite a stretch to view the system as “biological” in any 
sense, rather than just a simple and a priori reasonable modification of some previously 
deployed system. 

3.3.3 Artificial Life 

The two main objective of Artifical Life, closely related to one another, are understand- 
ing the abstract functioning and especially the origin of terrestrial life, and creating 
organisms that can meaningfully be called “alive” [143]. 

The first objective involves formalizing and abstracting the mechanical processes un- 
derpinning terrestrial life. In particular, much of this work involves various degrees of 
abstraction of the process of self-replication [41,219, 239]. Some of the more real- world- 
oriented work on this topic involves investigating how lipids assemble into more complex 
structures such as vesicles and membranes is one of the fundamental questions in the ori- 
gin of life [63, 70, 184, 187, 177]. Many computer models have been proposed to simulate 
this process, though most suffer from overly simplifying the molecular morphology. 

More generally, work concerned with the origin of life can constitute an investigation of 
the functional self-organization that gives rise to life [158]. In this regard, an important 
early work on functional self-organization is the lambda calculus, which provides an 
elegant framework (recursively defined functions, lack of distinction between object and 
function, lack of architectural restrictions) for studying computational systems [55]. This 
framework can be used to develop an artificial chemistry “function gas” that displays 
complex cooperative properties [83]. 

The second objective of the field of Artificial Life is less concerned with understanding 
the details of terrestrial life per se than of using terrestrial life as inspiration for how to 
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design living systems. For example, motivated by the existence (and persistence) of com- 
puter viruses, several workers have tried to design an immune system for computers that 
will develop “antibodies” and handle viruses both more rapidly and more efficiently than 
other algorithms [85, 131, 220]. More enerally, because we only have one sampling point 
(life on Earth), it is very difficult to precisely formulate the process by which life emerged 
By creating an artificial world inside a computer however, it is possible to study far more 
general forms of life [192, 193, 194]. See also [253] where the argument is presented that 
the richest way of approaching the issue of defining “life” is phenomenologically, in terms 
of self-dissimilar scaling properties of the system. 

3.3.4 Training cellular automata with genetic algorithms 

Cellular automata can be viewed as digital abstractions of physical gases [33, 77, 248, 
249]. Formally, they are discrete-time recurrent neural nets where the neurons live on a 
grid, each neuron has a finite number of potential states, and inter-neuron connections 
are (usually) purely local. (See below for a discussion of recurrent neural nets.) So the 
state update rule of each neuron is fixed and local, the next state of a neuron being a 
function of the current states of it and of its neighboring elements. 

The state update rule of (all the neurons making up) any particular cellular automaton 
specifies the mapping taking the initial configuration of the states of all of its neurons 
to the final, equilibrium (perhaps strange) attractor configuration of all those neurons. 
So consider the situation where we have a desired such mapping, and want to know an 
update rule that induces that mapping. This is a search problem, and can be viewed as 
similar to the inverse problem of how to design a COIN to achieve a pre-specified global 
goal, albeit a “COIN” whose nodal elements do not use RL algorithms. 

Genetic algorithms are a special kind of search algorithm, based on analogy with the 
biological process of natural selection via recombination and mutation of a genome [166]. 
There is no formal theory justifying genetic algorithms as search algorithms [152, 252] 
and very few empirical comparisons with other search techniques that might justify their 
use. Nonetheless, genetic algorithms (and ‘evolutionary computation’ in general) have 
been studied quite extensively. In particular, they have been used to (try to) solve 
the inverse problem of finding update rules for a cellular automaton that induce a pre- 
specified mapping from its initial configuration to its attractor configuration. To date, 
they have used this way only for extremely simple configuration mappings, mappings 
which can be trivially learned by other kinds of systems. Despite the simplicity of these 
mappings, the use of genetic algorithms to try to train cellular automata to exhibit them 
has achieved little success [168, 167, 58, 59]. 


17 



3.4 Physics-Based Systems 
3.4.1 Statistical Physics 

Equilibrium statistical physics is concerned with the stable state character of large num- 
bers of very simple physical objects, interacting according to well-specified local deter- 
ministic laws, with probabilistic noise processes superimposed [5, 195]. Typically there is 
no sense in which such systems can be said to have centralized control, since all particles 
contribute comparably to the overall dynamics. 

Aside from mesoscopic statistical physics, the numbers of particles considered are 
usually on the order of 10 23 , and the particles themselves are extraordinarily simple, 
typically having only a few degrees of freedom. Moreover, the noise processes usually 
considered are highly restricted, being those that are formed by “baths” , of heat, parti- 
cles, and the like. Similarly, almost all of the field restricts itself to deterministic laws 
that are readily encapsulated in Hamilton’s equations (Schrodinger’s equation and its 
field-theoretic variants for quantum statistical physics). In fact, much of equilibrium 
statistical physics isn’t even concerned with the dynamic laws by themselves (as for ex- 
ample is stochastic Markov processes). Rather it is concerned with invariants of those 
laws (e.g., energy), invariants that relate the states of all of the particles. Trivially then, 
deterministic laws without such readily-discoverable invariants are outside of the purview 
of much of statistical physics. 

One potential use of statistical physics for COINs involves taking the systems that 
statistical physics analyzes, especially those analyzed in its condensed matter variant 
(e.g., spin glasses [224, 225]), as simplified models of a class of COINs. This approach 
is used in some of the analysis of the Bar problem (see above). It is used more overtly 
in (for example) the work of Galam [93], in which the equilibrium coalitions of a set of 
“countries” are modeled in terms of spin glasses. This approach cannot provide a general 
COIN framework though. In addition to tire caveats fisted above, this is due to its not 
providing a general solution to inverse problems and its lack of RL algorithms. 5 

Another contribution that statistical physics can make is with the mathematical tech- 
niques it has developed for its own purposes, like mean field theory, self-averaging ap- 
proximations, phase transitions, Monte Carlo techniques, the replica trick, and tools to 
analyze the thermodynamic limit in which the number of particles goes to infinite. Al- 
though such techniques have not yet been applied to COINs, the have been successfully 
applied to related fields. This is exemplified by the use of the replica trick to analyze 
two-player zero-sum games with random payoff matrices in the thermodynamic limit of 
the number of strategies in [26]. Other examples are the numeric investigation of iter- 

5 In regard to the latter point however, it’s interesting to speculate about recasting statistical physics 
as a COIN, by having each of the particles in the physical system run an RL algorithm that perfectly 
optimizes the “utility function” of its Lagrangian, given the “actions” of the other particles. In this 
perspective, many-particle physical systems are multi-stage games that are at Nash equilibrium in each 
stage. 
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ated prisoner’s dilemma played on a lattice [233], the analysis of stochastic games by 
expressing of deviation from rationality in the form of a “heat bath [156], and the use 
of topological entropy to quantify the complexity of a voting system studied in [159]. 

Other work in the statistical physics literature is formally identical to that in other 
fields, but presents it from a novel perspective. A good example of this is [218] which 
analyzes use of a single simple proportional RL algorithm for control of a spatially 
extended system. (All without a single mention of the field of reinforcement learning.) 

3.4.2 Action Extremization 

Much of the theory of physics can be cast as solving for the extremization of an actional, 
which is a functional of the worldline of an entire (potentially many-component) sys- 
tem across all time. The solution to that extremization problem constitutes the actual 
worldline followed by the system. In this way the calculus of variations can be used to 
solve for the worldline of a dynamic system. As an example, simple Newtonian dynamics 
can be cast as solving for the worldline of the system that extremizes a quantity called 
“the Lagrangian”, which is a function of that worldline and of certain parameters (e.g., 
the “potential energy”) governing the system at hand. In this instance, the calculus of 
variations simply results in Newton’s laws. 

If we take the dynamic system to be a COIN, we are assured that its worldline 
automatically optimizes a “global goal” consisting of the value of the associated actional. 
If we change physical aspects of the system that determine the functional form of the 
actional (e.g., change the system’s potential energy function), then we change the global 
goal, and we are assured that our COIN optimizes that new global goal. 

The challenge in exploiting this to solve the inverse problem of how to design physical 
COINs is in translating an arbitrary provided global goal for the COIN into a parameter- 
ized actional. Note that that actional must govern the dynamics of the physical COIN, 
and the parameters of the actional must be physical variables in the COIN, variables 
whose values we can modify. 

3.4.3 Active Walker Models 

The field of active walker models [20, 104, 105] is concerned with modeling “walkers’ (be 
they human walkers or instead simple physical objects) crossing fields along trajectories, 
where those trajectories are a function of several factors, including in particular the 
trails already worn into the field. Often the kind of trajectories considered are those 
that can be cast as solutions to actional extremization problems so that the walkers can 
be explicitly viewed as agents optimizing a private utility. 

One of the primary concerns with the field of active walker models is how the trails 
worn in the field change with time to reach a final equilibrium state. The problem 
of how to design the cement pathways in the field (and other physical features of the 
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field) so that the final paths actually followed by the walkers will have certain desirable 
characteristics is then one of solving for parameters of the actional that will result in the 
desired worldline. This is a special instance of the inverse problem of how to design a 
COIN. 

Using active walker models this way to design COINs, like action extremization in 
general, probably has limited applicability. Also, it is not clear how robust such a design 
approach might be, or whether it would be scalable and exempt from the need for hand- 
tailoring. 

3.5 Other Related Subjects 

This subsection presents a “catch-all” of other fields that have little in common with one 
another except that they bear some relation to COINs. 

3.5.1 Stochastic Fields 

An extremely well-researched body of work concerns the mathematical and numeric 
behavior of systems for which the probability distribution over possible future states 
conditioned on preceding states is explicitly provided. This work involves many aspects 
of Monte Carlo numerical algorithms [176], all of Markov Chains [86, 180, 226], and 
especially Markov fields, a topic that encompasses the Chapman-Kolmogorov equations 
[94] and its variants: Liouville’s equation, the Fokker-Plank equation, and the Detailed- 
balance equation in particular. Non-linear dynamics is also related to this body of 
work (see the synopsis of iterated function systems below and the synopsis of cellular 
automata above), as is Markov competitive decision processes (see the synopsis of game 
theory above). 

Formally, one can cast the problem of designing a COIN as how to fix each of the 
conditional transition probability distributions of the individual elements of a stochastic 
field so that the aggregate behavior of the overall system is of a desired form. 6 Unfor- 
tunately, almost all that is known in this area instead concerns the forward problem, of 
inferring aggregate behavior from a provided set of conditional distributions. Although 
such knowledge provides many “bits and pieces” of information about how to tackle the 
inverse problem, those pieces collectively cover only a very small subset of the entire 
space of tasks we might want the COIN to perform. In particular, they tell us very little 
about the case where the conditional distribution encapsulates RL algorithms. 

6 In contrast, in the field of Markov decision processes, discussed in [47], the full system may be a 
Markov field, but the system designer only sets the conditional transition probability distribution of a 
few of the field elements at most, to the appropriate “decision rules”. Unfortunately, it is hard to imagine 
how to use the results of this field to design COINs because of major scaling problems. Any decision 
process must accurately model likely future modifications to its own behavior — often an extremely 
daunting task [153]. What’s worse, if multiple such decision processes are running concurrently in the 
system, each such process must also model the others, in their full complexity. 
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3.5.2 Iterated Function Systems 

The technique of iterated function systems [18, ?] grew out of the field of nonlinear 
dynamics [197, 234, 228, ?]. In such systems a function is repeatedly and recursively 
applied to itself. The most famous example is the logistic map, x n+ j = rx „(1 — x n ) for 
some r between 0 and 4 (so that x stays between 0 and 1). More generally the function 
along with its arguments can be vector- valued. In particular, we can construct such 
functions out of affine transformations of points in a Euclidean plane. 

Iterated functions systems have been applied to image data. In this case the succes- 
sive iteration of the function generically generates a fractal, one whose precise character 
is determined by the initial iteration-1 image. Since fractals are ubiquitous in natural 
images, a natural idea is to to try to encode natural images as sets of iterated func- 
tion systems spread across the plane, thereby potentially garnering significant image 
compression. The trick is to manage the inverse step of starting with the image to be 
compressed, and determining what iteration- 1 image(s) and iterating function(s) will 
generate an accurate approximation of that image. 

In the language of nonlinear dynamics, we have a dynamic system that consists of a 
set of iterating functions, together with a desired attractor (the image to be compressed). 
Our goal is to determine what values to set certain parameters of our dynamic system 
to so that the system will have that desired attractor. The potential relationship with 
COINs arises from this inverse nature of the problem tackled by iterated function sys- 
tems. If the goal for a COIN can be cast as its relaxing to a particular attractor, and 
if the distributed computational elements are isomorphic to iterated functions, then the 
tricks used in iterated functions theory could be of use. 

Although the techniques of iterated function systems might prove of use in designing 
COINs, they are unlikely to serve as a generally applicable approach to designing COINs. 
In addition, they do not involve RL algorithms, and often involve extensive hand-tuning. 

3.5.3 Recurrent Neural Nets 

A recurrent neural net consists of a finite set of “neurons” each of which has a real- valued 
state at each moment in time. Each neuron’s state is updated at each moment in time 
based on its current state and that of some of the other neurons in the system. The 
topology of such dependencies constitute the “inter-neuronal connections” of the net, 
and the associated parameters are often called the “weights” of the net. The dynamics 
can be either discrete or continuous (i.e., given by difference or differential equations). 

Recurrent nets have been investigated for many purposes [46, 115, 95, 185, 260]. 
One of the more famous of these is associative memories. The idea is that given a pre- 
specified pattern for the (states of the neurons in the) net, there may exist inter- neuronal 
weights which result in a basin of attraction focussed on that pattern. If this is the case, 
then the net is equivalent to an associative memory, in that a complete pre-specified 
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pattern across all neurons will emerge under the net’s dynamics from any initial pattern 
that partially matches the full pre-specified pattern. In practice, one wishes the net to 
simultaneously possess many such pre-specified associative memories. There are many 
schemes for “training” a recurrent net to have this property, including schemes based on 
spin glasses [112, 113, 114] and schemes based on gradient descent [198]. 

As can the fields of cellular automata and iterated function systems, the field of 
recurrent neural nets can be viewed as concerning certain variants of COINs. Also like 
those other fields though, recurrent neural nets has shortcomings if one tries to view it 
as a general approach to a science of COINs. In particular, recurrent neural nets do 
not involve RL algorithms, and training them often suffers from scaling problems. More 
generally, in practice they can be hard to train well without hand-tailoring. 

3.5.4 Network Theory 

Packet routing in a data network [27, 116, 221, 241] presents a particularly interesting 
domain for the investigation of COINs. In particular, with such routing: 

(i) the problem is inherently distributed; 

(ii) for all but the most trivial networks it is impossible to employ global control ; 

(iii) the routers have only access to local information (routing tables); 

(iv) it constitutes a relatively clean and easily modified 
experimental testbed; and 

(v) there are potentially major bottlenecks induced by ‘greedy’ 
behavior on the part of the individual routers, which behavior 
constitutes a readily investigated instance of the TOC. 

Many of the approaches to packet routing incorporate a variant on RL [39, 44, 53, 
149, 155]. Q-routing is perhaps the best known such approach and is based on routers 
using reinforcement learning to select the best path [39]. Although generally successful, 
Q-routing is not a general scheme for inverting a global task. This is even true if one 
restricts attention to the problem of routing in data networks — there exists a global 
task in such problems, but that task is directly used to construct the algorithm. 

A particular version of the general packet routing problem that is acquiring increased 
attention is the Quality of Service (QoS) problem, where different communication pack- 
ets (voice, video, data) share the same bandwidth resource but have widely varying 
importances both to the user and (via revenue) to the bandwidth provider. Determining 
which packet has precedence over which other packets in such cases is not only based 
on priority in arrival time but more generally on the potential effects on the income of 
the bandwidth provider. In this context, RL algorithms have been used to determine 
routing policy, control call admission and maximize revenue by allocation the available 
bandwidth efficiently [44, 155]. 

Many researchers have exploited the noncooperative game theoretic understanding 
of the TOC in order to explain the bottleneck character of empirical data networks 
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behavior and suggest potential alternatives to current routing schemes [?, 69, 136, 142, 
144, 182, 183, 215]. Closely related is work on various “pricing” -based resource allocation 
strategies in congestable data networks [151]. This work is at least partially based upon 
current understanding of pricing in toll lanes, and traffic flow in general (see below). 
All of these approaches are particularly of interest when combined with the RL-based 
schemes mentioned just above. Due to these factors, much of the current research on 
a general framework for COINs is directed toward the packet-routing domain (see next 
section). 

3.5.5 Traffic Theory 

Traffic congestion typifies the TOC public good problem: everyone wants to use the 
same resource, and all parties greedily trying to optimize their use of that resource not 
only worsens global behavior, but also worsens their own private utility (e.g., if everyone 
disobeys traffic lights, everyone gets stuck in traffic jams). Indeed, in the well-known 
Braess’ paradox [19], keeping everything else constant — including the number and 
destinations of the drivers — but opening a new traffic path can increase everyone’s time 
to get to their destination. (Viewing the overall system as in instance of the Prisoner’s 
dilemma, this paradox in essence arises through the creation of a novel ‘defect-defect’ 
option for the overall system.) Greedy behavior on the part of individuals also results 
in very rich global dynamic patterns, such as stop and go waves and clusters [106, 107]. 

Much of traffic theory employs and investigates tools that have previously been ap- 
plied in statistical physics [106, 133, 134, 186, 191] (see subsection above). In particular, 
the spontaneous formation of traffic jams provides a rich testbed for studying the emer- 
gence of complex activity from seemingly chaotic states [106, 108]. Furthermore, the 
dynamics of traffic flow is particular amenable to the application and testing of many 
novel numerical methods in a controlled environment [15, 28, 209]. Many experimental 
studies have confirmed the usefulness of applying insights gleaned from such work to real 
world traffic scenarios [106, 174, 173]. 

3.5.6 Topics from further afield 

Finally, there are a number of other fields that, while either still nascent or not extremely 
closely related to COINs, are of interest in COIN design: 

Amorphous computing: Amorphous computing grew out of the idea of replacing 
traditional computer design, with its requirements for high reliability of the components 
of the computer, with a novel approach in which widespread unreliability of those com- 
ponents would not interfere with the computation [1]. Some of its more speculative 
aspects are concerned with “how to program” a massively distributed, noisy system 
of components which may consist in part of biochemical and/or biomechanical compo- 
nents [135, 245]. Work here has tended to focus on schemes for how to robustly induce 
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desired geometric dynamics across the physical body of the amorphous computer — issue 
that are closely related to morphogenesis, and thereby lend credence to the idea that 
biochemical components are a promising approach. Especially in its limit of computers 
with very small constituent components, amorphous computing also is closely related to 
the fields of nanotechnology [66] and control of smart matter (see below). 

Control of smart matter:. As the prospect of nanoteclinology-driven mechanical 
systems gets more concrete, the daunting problem of how to robustly control, power, 
and sustain protean systems made up of extremely large sets of nano-scale devices looms 
more important [98, 99, 109]. If this problem were to be solved one would in essence 
have “smart matter” . For example, one would be able to “paint” an airplane wing with 
such matter and have it improve drag and lift properties significantly. 

Morphogenesis: How does a leopard embryo get its spots, or a zebra embryo its 
stripes? More generally, what are the processes underlying morphogenesis, in which 
a body plan develops among a growing set of initially undifferentiated cells? These 
questions, related to control of the dynamics of chemical reaction waves, are essentially 
special cases of the more general question of how ontogeny works, of how the genotype- 
phenotype mapping is carried out in development. The answers involve homeobox (as 
well as many other) genes [17, 68, 128, 82, 237]. Under the presumption that the 
•functioning of such genes is at least in part designed to facilitate genetic changes that 
increase a species’ fitness, that functioning facilitates solution of the inverse problem, of 
finding small-scale changes (to DNA) that will result in “desired” large scale effects (to 
body plan) when propagated across a growing distributed system. 

Self Organizing systems The concept of self-organization and self-organized crit- 
icality [14] was originally developed to help understand why many distributed physical 
systems are attracted to critical states that possess long-range dynamic correlations in 
the large-scale characteristics of the system. It provides a powerful framework for analyz- 
ing both biological and economic systems. For example, natural selection (particularly 
punctuated equilibrium [71, 97]) can be likened to self-organizing dynamical system, 
and some have argued it shares many the properties (e.g., scale invariance) of such sys- 
tems [60]. Similarly, one can view the economic order that results from the actions of 
human agents as a case of self-organization [62], The relationship between complexity 
and self-organization is a particularly important one, in that it provides the potential 
laws that allow order to arise from chaos [129]. 

Small worlds (6 Degrees of Separation): In many distributed systems where each 
component can interact with a small number of “neighbors”, an important problem is how 
to propagate information across the system quickly and with minimal overhead. On the 
one extreme the neighborhood topology of such systems can exist on a completely regular 
grid-like structure. On the other, the topology can be totally random. In either case, 
certain nodes may be effectively ‘cut-off’ from other nodes if the information pathways 
between them are too long. Recent work has investigated “small worlds” networks 
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(sometimes called 6 degrees of separation) in which underlying grid-like topologies are 
“doped” with a scattering of long-range, random connections. It turns out that very little 
such doping is necessary to allow for the system to effectively circumvent the information 
propagation problem [160, 244]. 

Control theory: Adaptive control [6, 204], and in particular adaptive control involv- 
ing locally weighted RL algorithms [7, 170], constitute a broadly applicable framework 
for controlling small, potentially inexactly modeled systems. Augmented by techniques 
in the control of chaotic systems [54, ?], they constitute a very successful way of solving 
the “inverse problem” for such systems. Unfortunately, it is not clear how one could 
even attempt to scale such techniques up to the massively distributed systems of interest 
in COINs. The next section discusses in detail some of the underlying reasons why the 
purely model-based versions of these approaches are inappropriate as a framework for 
COINs. 

4 A FRAMEWORK DESIGNED FOR COINs 

Summarizing the discussion to this point, it is hard to see how any already extant 
scientific field can be modified to encompass systems meeting all of the requirements of 
COINs listed at the beginning of Section 3. This is not too surprising, since none of those 
fields were explicitly designed to analyze COINs. This section first motivates in general 
terms a framework that is explicitly designed for analyzing COINs. It then presents the 
formal nomenclature of that framework. This is followed by deriving some of the central 
theorems of that framework. Finally, we present experiments that illustrate the power 
the framework provides for ensuring large world utility in a COIN. 

Unfortunately, for reasons of space, the discussion here is abbreviated and laconic. 
A much more detailed discussion, including intuitive arguments, proofs and fully formal 
definitions of the concepts discussion in this section, can be found in [250]. 

4.1 Problems with a model-based approach 

What mathematics might one employ to understand and design COINs? Perhaps the 
most natural approach, related to the stochastic fields work reviewed above, involves the 
following three steps: 

1) First one constructs a complete stochastic model of the COIN’S dynamics, a model 
parameterized by a vector 6. As an example, 0 could fix the utility functions of the 
individual agents of the COIN, aspects of their RL algorithms, which agents communicate 
with each other and how, etc. 

2) Next we solve for the function f(6) which maps the parameters of the model to 
the resulting stochastic dynamics. 

3) Cast our goal for the system as a whole as achieving a high expected value of some 
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“world utility”. Then as our final step we would have to solve the inverse problem: we 
would have to search for a 0 which, via /, results in a high value of E(world utility | 0). 

Let’s examine in turn some of the challenges each of these three steps entrain: 

I) We are primarily interested in very large, very complex systems, which are noisy, 
faulty, and often operate in a non-stationary environment. Moreover, our “very complex 
system” consists of many RL algorithms, all potentially quite complicated, all running 
simultaneously. Clearly coming up with a model that captures the dynamics of all of this 
in an accurate manner will often be extraordinarily difficult. Moreover, unfortunately, 
often the level of versimiltude required of the model will be quite high. For example, 
unless the modeling of the faulty aspects of the system were quite accurate, the model 
would likely be “brittle”, and overly sensitive to which elements of the COIN were and 
were not operating properly at any given time. 

II) Even for models much simpler than the ones called for in (I), solving explicitly for 
the function / can be extremely difficult. For example, much of Markov Chain theory 
is an attempt to broadly characterize such mappings. However as a practical matter, 
usually it can only produce potentially useful characterizations when the underlying 
models are quite inaccurate simplifications of the kinds of models produced in step (I). 

HI) Even if one can write down an /, solving the associated inverse problem is often 
impossible in practice. 

IV) In addition to these difficulties, there is a more general problem with the model- 
based approach. Wc wish to perform our analysis on a “high level”. Our thesis is that 
due to the robust and adaptive nature of the individual agents’ RL algorithms, there 
will be very broad, easily identifiable regions of 0 space all of which result in excellent 
E(world utility | 0), and that these regions will not depend on the precise learning 
algorithms used to achieve the low-level tasks (cf. the list at the beginning of Section 3). 
To fully capitalize on this one would want to be able to slot in and out different learning 
algorithms for achieving the low-level tasks without having to redo our entire analysis 
each time. However in general this would be possible with a model-based analysis only 
for very carefully designed models (if at all). The problem is that the result of step 
(3), the solution to the inverse problem, would have to concern aspects of the COIN 
that are (at least approximately) invariant with respect to the precise low-level learning 
algorithms used. Coming up with a model that has this property while still avoiding 
problems (I-III) is usually an extremely daunting challenge. 

Fortunately, there is an alternative approach which avoids modeling and its associated 
difficulties. We call any framework based on this alternative a descriptive framework. 
In such a framework one identifies certain salient characteristics of COINs, which are 
characteristics that one strongly expects to find in COINs that have large world utility. 
Under this expectation, one assumes that if a COIN is explicitly modified to have the 
salient characteristics, perhaps in response to observations of its run-time behavior, then 
its world utility will benefit. If those salient characteristics are (relatively) easy to induce 
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in a COIN, then this assumption provides a ready way to cause that COIN to have large 
world utility. If in addition the salient characteristics can be induced with little or no 
modeling (e.g., via heuristics that aren’t rigorously and formally justified), then the 
descriptive framework can be used to improve world utility without recourse to detailed 
modeling. 

4.2 Nomenclature 

There exist many ways one might try to design a descriptive framework. In this subsec- 
tion we present nomenclature needed for a (very) cursory overview of one of them. (See 
[250] for a more detailed exposition, including formal proofs.) 

4.2.1 Preliminary Definitions 

1) We refer to an RL algorithm by which an individual component of the COIN modifies 
its behavior as a microlearning algorithm. We refer to the initial construction of the 
COIN, potentially based upon salient characteristics, as the COIN initialization. We 
use the phrase macrolearning to refer to externally imposed run-time modifications to 
the COIN which are based on statistical inference concerning salient characteristics of 
the running COIN. 

2) For convenience, we take time t to be discrete and confined to the integers, Z. 
When referring to COIN initialization, we implicitly have a lower bound on t. which 
without loss of generality we take to be < 0. 

3) All variables that have any effect on the COIN are identified as components of 
Euclidean-vector-valued states of various discrete nodes. So for example, if our COIN 
consists in part of an “agent” running a set of rnicrolearning algorithm, the precise 
configuration of that agent at any time t, including all variables in its learning algorithm, 
all externally visible actions, internal parameters, values observed by its probes of the 
surrounding environment, etc., all constitute the state vector of a node representing that 
agent. We define £ f € Z q to be the Euclidean vector giving the state of node r] at time 
t. The j’th component of that vector is indicated by C qt . i: 

Discussion: In practice, many COINs will involve variables that are most naturally 
viewed as discrete and symbolic. In such cases, we must exercise some care in how we 
choose to represent those variables as components of Euclidean vectors. There is nothing 
new in this; the same issue arises in modern work on applying neural nets to inherently 
symbolic problems. With COINs we will usually employ the same resolution of this issue 
employed in neural nets, namely representing the possible values of the discrete variable 
with a unary representation in a Euclidean space. Just as with neural nets, values of 
such vectors that do not lie on the vertices of the unit hypercube are not meaningful, 
strictly speaking. Fortunately though, just as with neural nets, there is almost always a 
most natural way to extend the definitions of any function of interest (like world utility) 
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so that it is well-defined even for vectors not lying on those vertices. This allows us to 
meaningfully define partial derivatives of such functions with respect the components of 

partial derivatives that we will evaluate at the corners of the unit hypercube. 

4) For notational convenience, we define C t € Z t the vec t° r °f the states of 

all nodes at time t: O , G Z- nf to be the vector of the states of all nodes other than 

1 -2. TJyt — ” ,li 

r] at time t\ and C = C G Z to be the entire vector of the states of all nodes at all 
times. Z is infinite-dimensional in general, and usually assumed to be a Hilbert space. 
Also for notational convenience, we define gradients using 3-shorthand. So for example, 
d ^ F( C ) is the vector of the partial derivative of F{( ) with respect to the components 
of C t - Finally, we will sometimes treat the symbol “f” specially, as delineating a range 
of components of (. So for example an expression like “C f<1 ,” refers to all components 
£ t with t < t'. 

5) We take the universe in which our COIN operates to be completely deterministic. 
This is certainly the case for any COIN that operates in a digital system, even a system 
that emulates analog and/or stochastic processes (e.g., with a pseudo-random number 
generator). More generally, this determinism reflects the fact since the real world obeys 
(deterministic) physics, any real-world system, be it a a COIN or something else, is, 
ultimately, embedded in a deterministic system. ' 

6) Formally, to reflect this determinism, first we bundle all variables we’re not directly 
considering — but which nonetheless affect the dynamics of the system as components 
of some catch-all environment node. So for example any “noise processes” and the 
like affecting the COIN’S dynamics are taken to be inputs from a deterministic, very 
high-dimensional environment that is potentially chaotic and is never directly observed 
[?]. Given such an environment node, we then stipulate that for all t,t' > f, s< ’* ,s C // 
uniquely. 

7) We express the dynamics of our system by writing ( t > >t = C(C f )- ( In ^ lis P a P er 
there will be no need to be more precise and specify the precise dependency of C(.) on t 
and/or £'.) We define {C} to be a set of constraint equations enforcing that dynamics, 
and also, more generally, fixing the entire manifold C of vectors ( 6 Z that we consider 
to be ‘allowed’. So C is a subset of the set of all C € Z that are consistent with 
the deterministic laws governing the COIN, i.e., that obey C ( >> f = C(C <t ) ^ C t' . We 
generalize this notation in the obvious way, so that (for example) C y t>t 0 is the manifold 
consisting of all vectors / <>( G Z,t>t 0 that are projections of a vector in C . 

Discussion: Note that C yt >t 0 is parameterized by / ^ , due to determinism. Note also 
that whereas C{.) is defined for any argument of the form ( f € Z,t for some t (i.e., we 
can evolve any point forward in time), in general not all / ^ G Z,t * n C,t* I n particular, 

7 This determinism holds even for systems with an explicitly quantum mechanical character. Quantum 
mechanical systems evolve according to Schrodinger’s equation, which is purely deterministic; as is now 
well-accepted, the “stochastic” aspect of quantum mechanics can be interpreted as an epiphenomenon of 
Schrodinger’s equation that arises when the Hamiltonian has an “observational” or “entangling coupling 
between some of its variables [?, ?, ?], a coupling that does not obviate the underlying determinism. 
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there may be extra restrictions constraining the possible states of the system beyond 
those arising from its need to obey the relevant dynamical laws of physics. 

Discussion: We do not want to have Z be the phase space of every particle in the 
system. We will instead usually have Z consist of variables that, although still evolving 
deterministically, exist at a larger scale of granularity than that of individual particles 
(e.g., thermodynamic variables in the thermodynamic limit). However we will often 
be concerned with physical systems obeying entropy-driven dynamic processes that are 
contractive at this high level of granularity. Examples are any of the many-to-one map- 
pings that can occur in digital computers, and, at a finer level of granularity, any of the 
error-correcting processes in the electronics of such a computer that allow it to operate 
in a digital fashion. Accordingly, although the dynamics of our system will always be 
deterministic, it need not be invertible. 

Discussion: Intuitively, in our mathematics, all behavior across time is pre-fixed. The 
COIN is a single fixed worldline through Z, with no “unfolding of the future” as the 
die underlying a stochastic dynamics get cast. This is consistent with the fact that we 
want the formalism to be purely descriptive, relating different properties of any single, 
fixed COIN’s history. We will often informally refer to “changing a node’s state at 
a particular time”, or to a microlearner’s “choosing from a set of options”, and the 
like. Formally, in all such phrases we are really comparing different worldlines, with the 
indicated modification distinguishing those worldlines. 

Discussion: Since the dynamics of any real-world COIN is deterministic, so is the dy- 
namics of any component of the COIN, and in particular so is any learning algoritlnii 
running in the COIN, ultimately. However that does not mean that those determin- 
istic components of the COIN are not allowed to be “based on”, or “motivated by” 
probability-based concepts. The motivation behind the algorithms run by the compo- 
nents of the COIN does not change their underlying nature. Indeed, in our experiments 
below, we explicitly have the reinforcement learning algorithms that are trying to max- 
imize private utility operate in a (pseudo-) probabilistic fashion, with pseudo-random 
number generators and the like. 

More generally, the deterministic nature of our framework does not preclude our su- 
perimposing probabilistic elements on top of that framework, and thereby generate a 
stochastic extension of our framework. Exactly as in statistical physics, a stochastic na- 
ture can be superimposed on top of our space of deterministic worldlines. Formally, this 
is what is done in conventional time-series analysis (or for that matter all of conventional 
statistics), where the superimposing of a probability distribution across a space of possi- 
ble histories of the universe in no way violates the physical fact that each of the histories 
taken individually obeys the (deterministic) laws of physics. Indeed, the “macrolearning” 
algorithms we investigate below implicitly involve such a superimposing; they impliictly 
assume a probabilistic coupling between the (statistical estimate of the) correlation co- 
efficient connecting the states of a pair of nodes and whether those nodes are in the one 
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another’s “effect set”. In this paper though, we concentrate on the mathematics that 
obtains before such probabilistic concerns are superimposed. Whereas the deterministic 
analysis presented here is related to game-theoretic structures like Nash equilibria, a 
full-blown stochastic extension would in some ways be more related to structures like 
correlated equilibria [9]. 

8) Formally, there is a lot of freedom in setting the boundary between what we call 
“the COIN”, whose dynamics is determined by C, and what we call “macrolearning”, 
which constitutes perturbations to the COIN instigated from “outside the COIN”, and 
which therefore is not reflected in C. As an example, in much of this paper, we have 
clearly specified microlearners which are provided fixed private utility functions that they 
are trying to maximize. In such cases usually we will implicitly take C to be the dynamics 
of the system, microlearning and all, for fixed private utilities that are specified in (. For 
example, C could contain, for each microlearner, the bits in an associated computer 
specifying the subroutine that that microlearner can call to evaluate what its private 
utility would be for some full worldline C- 

Macrolearning overrides C, and in this situation it refers (for example) to any statis- 
tical inference process that modifies the private utilities at run-time to (try to) induce 
the desired salient characteristics. Concretely, in the preceding example, macrolearning 
could involve modifications to the bits bt specifying each microlearner i's private utility, 
modifications that are not accounted for in C, and that are potentially based on variables 
that are not reflected in Z. Since C does not reflect such macrolearning, when trying 
to ascertain C based on empirical observation (as for example when determining how 
best to modify the private utilities), we have to take care to distinguish which part of 
the system’s observed dynamics is due to C and which part instead reflects externally 
imposed modifications to the private utilities. 

More generally though, other boundaries between the COIN and macrolearning-based 
perturbations to it are possible, reflecting other definitions of Z, and other interpreta- 
tions of the elements of each £ € Z. For example, it may be that in addition to the 
dynamics of other bits, C also encapsulates the dynamics of the bits !>,. In this case, we 
could view each private utility as still being fixed, but rather than take the bits bi as 
“encoding” the private utility of microlearner i , we would treat them as “parameters” 
of that (fixed) private utility. In other words, formally, they constitute an extra set 
of arguments to i’s private utility. Alternatively, we could simply say that our private 
utilities are time-indexed, with i’s private utility at time t determined by which in 
turn is determined by evolution under C. Under either interpretation of private utility, 
any modification to those bits under C constitutes dynamical laws by which the pa- 
rameters of the microlearners evolve in time. In this case, macrolearning would refer to 
some further removed process that modifies the evolution of the system in a way not 
encapsulated in C . 

For such alternative definitions of C/Z, we have a different boundary between the 
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COIN and macrolearning, and we must scrutinize different aspects of the COIN'S dynam- 
ics to infer C. Whatever the boundary, the mathematics of the descriptive framework, 
including the mathematics concerning the salient characteristics, is restricted to a system 
evolving according to C\ and explicitly does not account for macrolearning. This is why 
the strategy of trying to improve world utility by using macrolearning to try to induce 
salient characteristics is ultimately based on an assumption rather than a proof. 

9) We are provided with some Von Neumann world utility G : Z — > TZ that ranks 
the various conceivable worldlines of the COIN. Note that since the environment node 
is never directly observed, we implicitly assume that the world utility is not directly 
(!) a function of its state. Our mathematics will not involve G alone, but rather the 
the relationship between G and various sets of personal utilities : Z x Z — > TZ. 
Intuitively, as discussed below, for many purposes such personal utilities are equivalent 
to the private utilities mentioned above. 

Discussion: These utility definitions are very broad. In particular, they do not require 
casting of the utilities as discounted sums. Note also that our world utility is not indexed 
by t. Again reflecting the descriptive, worldline character of the formalism, we simply 
assign a single value to an entire worldline of the system, implicitly assuming that one 
can always say which of two candidate worldlines are preferable. So given some “present 
time” t 0 , issues like which of two “potential futures” C t>to ^ t>to is preferable are resolved 
by evaluating the relevant utility at two associated points £ and , where the t > to 
components of those points are the futures indicated, and the two points share the same 
(usually implicit) t < to “past” components. 

This time-independence of G automatically avoids formal problems that can occur 
with general (i.e., not necessarily discounted sum) time-indexed utilities, problems like 
having what's optimal at one moment in time conflict with what’s optimal at other mo- 
ments in time . 8 For personal utilities such formal probelms are often irrelevant however. 
We as COIN designers must be able to rank all possible worldlines of the system to have 
a well-defined design task. However if a particular microlearner’s goal keeps changing 
in an inconsistent way, that simply means that that microlearner will grow “confused” . 
From our perspective as COIN designers, there is nothing a priori unacceptable about 
such confusion. It may even result in better performance of the system as a whole, in 
whic case we would actually want to induce it. Nonetheless, for simplicity, in most of 
this paper we will have all g n j be independent of t, just like world utility. 

World utility is defined as that function that we are ultimately interested in optimiz- 
ing. In conventional RL it is a discounted sum, with the sum starting at time t. In 
other words, conventional RL has a time-indexed world utility. It might seem that in 

8 Such conflicts can be especially troublesome when they interfere with our defining what we mean by 
an “optimal” set of actions by the nodes at a particular time t . Their ability to interfere in this way is 
due to the fact that the effects of the actions by the nodes depends on the future actions of the nodes. 
However if they too are to be optimal, those future actions will depend on their futures. So we have a 
potentially inconsistent infinite regress of stipulations of what “optimal” actions entails. 
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this at least, conventional RL considers a case that has more generality than that of 
the COIN framework presented here. (It obviously has less generality in that its world 
utility is restricted to be a discounted sum.) In fact though, the apparent time-indexing 
of conventional RL is illusory, and the time-dependent discounted sum world utilty of 
conventional RL is actually a special case of the non-time-indexed world utility of the 
COIN framework. To see this formally, consider any (time-independent) world utility 
G{ 0 that equals 7 *r(C t ) for some function r(.) and some positive constant 7 with 
magnitude less than 1. Then for any t' > 0 and any and (" where C t<t i ~ ^t<t' ' 
sgn[G( C') - G(C")] = o7^(C' t ) “ £~o 7 ^(C t )]- Conventional RL merely ex- 
presses this in terms of time-dependent utilities u t'(C t>t i) = 7 * ' r (C t ) by writing 

syn[G(C') - G((")] = sgn[u t > ((') - Ut'(C")] for a11 l ' ■ Since utility functions are, by defi- 
nition, only unique up to the relative orderings they impose on potential values of their 
arguments, we see that conventional RL’s use of a time-dependent discounted sum world 
utility u t ’ is identical to use of a particular time-independent world utility in the COIN 
framework. 

10 )As mentioned above, there may be variables in each node’s state which, under 
one particular interpretation, represent the “utility functions” that the associated mi- 
crolearner’s computer program is trying to extremize. When there are such components 
of C, we refer to the utilities they represent as private utilities. However even when 
there are private utilities, formally we allow the personal utilities to differ from them. 
The personal utility functions {</,;} do not exist “inside the COIN’; they are not specified 
by components of C- This separating of the private utilities from the {(/,,} will allow us 
to avoid the teleological problem that one may not always be able to explicitly identify 
“the” private utility function reflected in C such that a particular computational device 
can be said to be a microlearner “trying to increase the value of its private utility” . To 
the degree that we can couch the theorems purely in terms of personal rather than pri- 
vate utilities, we will have successfully adopted a purely behaviorist approach, without 
any need to interpret what a computational device is “trying to do” . 

Despite this formal distinction though, often we will implicitly have in mind deploying 
the personal utilities onto the microlearners as their private utilities, in which case the 
terms can usually be used interchangeably. The context should make it clear when this 
is the case. 

4.2.2 Intelligence 

We will need a measure of the performance of an arbitrary worldline ( for an arbitrary 
utility function under arbitrary dynamic laws C. Such a measure is a mapping from 
three arguments to R. Having such a measure will allow us to quantify how well the 
entire system performs in terms of G. It will also allow us to quantify how well each 
microlearner performs in purely behavioral terms, in terms of its personal utility. (In 
our behaviorist approach, we do not try to make specious distinctions between whether 
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a microlearer performs well due to its “innate sophistication’ 1 , or rather “by sheer luck” 
— all that matters is how effective its behavior is.) This behaviorism in turn will allow 
us to avoid having private utilities explicitly arise in our theorems (although they still 
arise frequently in pedagogical discussion). Even when private utilities exist, there will 
be no formal need to explicitly identify some components of ( as such utilities. Assuming 
a node’s microlearner is competent, the fact that it is trying to optimize some particular 
private utility U will be manifested in our performance measure’s having a large value 
at £ for C for that utility U. 

The problem of how to formally define such a performance measure is essentially 
equivalent to the problem of how to quantify bounded rationality in game theory. Some 
of the relevant work in game theory is concerned with refinements of equilibria, and 
adopts a strongly teleological perspective on rationality [?]). In general, such work is 
only narrowly applicable, to those situations where the rationality is bounded due to the 
precise causal mechanisms investigated in that work. Most of the other game-theoretic 
work first models (!) the microlearner, as some extremely simply computational device 
(e.g., a deterministic finite automaton (DFA)). One then assumes that the microlearner 
performs perfectly for that device, so that one can measure that learner’s performance 
in terms of some computational capacity measure of the model (e.g., for a DFA, the 
number of states of that DFA) [89, 178, 203]. However if taken as renditions of real- 
world computer-based microlearners (never mind human microlearners!), the models in 
this approach are often extremely abstracted, with many important characteristics of the 
real learners absent or distorted. In addition, there is little reason to believe that any 
results arising from this approach would not be highly dependent on the model choice and 
on the associated representation of computational capacity. Yet another disadvantage is 
that this approach concentrates on perfect, fully rational behavior of the microlearners. 

We would prefer a less model-dependent approach, one based solely on the utility 
function at hand, (, and C. Now we don’t want our performance measure to be a 
“raw” utility value like <^(0, since that is not invariant with respect to monotonic 
transformations of g v . Similarly, we don’t want to penalize the microlearner for not 
achieving a certain utility value if that value was impossible to achieve due to C and 
the actions of other nodes. A natural way to address these concerns is to generalize 
the game- theoretic concept of “best-response strategy” and consider the problem of how 
well rj performs given the actions of the other nodes. Such a measure would compare the 
possible states of ?? at some particular time, which without loss of generality we can take 
to be 0, to the actual state ( n . In other words, we would compare the utility of the 
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actual worldline C to those of a set of alternative worldlines where C Q = C' j0 , and 
use those comparisons to quantify the quality of //’s performance. 

Now we’re only concerned with comparing the effects of replacing ( with (' on future 
contributions to the utility. But if we allow arbitrary (' t<0 , then in and of themselves 
the difference between those past components of (' and those of ( can modify the value 
of the utility, regardless of the effects of any difference in the future components. Our 
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presumption is that for all COINs of interest we can avoid this conundrum by restricting 
attention to those <' where C t<0 differs from ( , <0 only in the internal parameters of 
tj's microlearner, differences that only at times t > 0 manifest themselves in a form the 
utility is concerned with. (In game-theoretic terms, such “internal parameters” encode 
full extensive form strategies, and we only consider changes to the vertices at or below the 
t = 0 level in the tree of an extensive-form strategy.) Under this presumption, without 
violating C, we’re able to pose the question, “if we change the state of 77 at time 0 in 
such-and-such a way, leaving everything else of interest at that time unchanged, what 
are the ramifications on the utility?” 

However we don’t want to restrict the computational algorithms that can run on 
a node to those that have a clearly pre-specified set of “internal parameters” and the 
like. So instead, we formalize our presumption behaviorally. Since changing the internal 
parameters doesn’t affect the t < 0 components of £ that the utility is concerned with , 
and since we are only concerned with changes to £ that affect the utility, we simply elect 
to not change the t < 0 values of the internal parameters of C r; at all. In other words, 
we leave C f<() un changed — which is something we can do just as easily whether 7/ does 
or doesn’t have any “internal parameters” in the first place. 

So in quantifying the performance of 77 for behavior given by C w( ‘ compare ( to a 
set of C', a set restricted to those (' sharing (’s past: C t<0 — ( f<0 > C^ 0 = C, ; (r and 
C',> 0 € C,t> o- Since 0 is free to vary (reflecting the possible changes 1 in the state 
of rf at time 0), ^ C , in general, and we may even wish to allow C^ >0 ^ o in 

certain circumstances. (Recall that C may reflect other restrictions imposed on allowed 
worldlines besides adherence to the underlying dynamical laws, so simply obeying those 
laws does not force a worldline to lie on C.) However our presumption is that as far 
as utility values are concerned, considering such £ is equivalent to considering a more 
restricted set of with “modified internal parameters 1 ’, all of which are E C. 

We now present a formalization of this performance measure. Given C and a mea- 
sure d/i(C n ) demarcating what points in Z„ 0 we’re interested in, we define the ( t = 0) 
intelligence for node r/ of a point £ with respect to a utility U as follows: 

€ v,v(C) = / M Q &[U ( 0 - U ( c (<0 , C(c; 0 ))] x 6(C 1) 0 - C^ 0 ) (1) 

where ©(.) is the Heaviside theta function which equals 0 if its argument is below 0 and 
equals 1 otherwise, (i(.) is the Dirac delta function, and we assume that f = 1. 

Intuitively, e^uiO measures the fraction of alternative states of 77 which, if 77 had 
been in those states at time 0, would either degrade or not improve 77’s performance (as 
measured by U). As an example, conventional full rationality game theory involving 
Nash equilibria is exclusively concerned with scenarios in which all such fractions equal 
l. 9 More generally, competent greedy pursuit of private utility U by the microlearner 

9 As an alternative to such fully rational games, one can define a bounded rational game as one in 
which the intelligences equal some vector f whose components need not all equal 1 . Many of the theorems 
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controlling node rj means that the intelligence of 77 for personal utility U , e^ui C)> * s 
close to 1. Accordingly, we will often refer interchangeably to a capable microlearner’s 
“pursuing private utility 17”, and to its having high intelligence for personal utility U. 
Alternatively, if the microlearner for node 77 is incompetent, then it may even be that 
“by luck” its intelligence for some personal utility {g^} exceeds its intelligence for the 
different private utility that it’s actually trying to maximize, U^. 

Any two utility functions that are related by a monotonically increasing transforma- 
tion reflect the same preference ordering over the possible arguments of those functions. 
Since it is only that ordering that we are are ever concerned with, we would like to remove 
this degeneracy by “normalizing” all utility functions. To see what this means in the 
COIN context, fix £ . Viewed as a function from — ► 1 Z, e 7U u (C^ 1 •) is itself a utility 
function. It says how well rj would have performed for all points £ . Accordingly, the 
integral transform taking U to e^u { C 1 *) is a (contractive, non-invertible) mapping from 
utilities to utilities. It can be proven that any mapping from utilities to utilities that 
meets certain simple desiderata must be such an integral transform. (An example of such 
a desideratum is that the mapping has the same output utility for any two input utilities 
that are monotonically increasing transforms of one another.) In this, intelligence is the 
unique way of “normalizing” Von Neumann utility functions. 

For those conversant with game theory, it is worth noting some of the interesting 
aspects that ensue from this normalizing nature of intelligences. At any point £ that is a 
Nash equilibrium in the personal utilities all intelligences {e^ P7J (C)} must equal 1. 
Since that is the maximal value any intelligence can take on, a Nash equilibrium in the 
{ g T] } is a Pareto optimal point in the associated intelligences (for the simple reason that 
no deviation from such a £ can raise any of the intelligences). Now restrict attention to 
systems with only a single instant of time (i.e., single-stage games), and have each of the 
(real-valued) components of each £ be a mixing component of an associated one of 77’s 
potential strategies for some underlying game, with g^iC) being the associated expected 
payoff to 77. (So the payoff to 7/ of the underlying pure strategies is given by the values 
of ^(C) when £ is a unit vector in the space of r/’s possible states.) Then we know 
that there must exist at least one Nash equilibrium in the {^}. In turn, whenever we 
are assured of a Nash equilibrium in the { g 7/ }, the set of such equilibria is identical to 
the set of points that are Pareto optimal in the associated intelligences. (See Eq. 5 in 
the discussion of factored systems below.) 

of conventional game theory can be directly carried over to apply to such bounded-rational games [251] 
by redefining the utility functions of the players. I.e., much of conventional full rationality game theory 
applies even to games with bounded rationality, under the appropriate transformation. This potentially 
has major implications for the common criticism of modern economic theory that its full rationality 
assumption does not hold in the real world. 
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4.2.3 Learnability 


Intelligence can be a difficult quantity to work with, unfortunately. As an example, fix 
7/, and consider any (small region centered about some) C that is not a local maximum of 
some utility U. Then by increasing the values of U evaluated in that small region we will 
increase the intelligence £/,,[/ (()• However in doing this we will also necessarily decrease 
the intelligence at points outside that region. So intelligence has a non-local character, a 
character that prevents us from directly modifying it to ensure that it is simultaneously 
high for any and all (. 

A second, more general problem is that without specifying the details of a mi- 
crolearner, it can be extremely difficult to predict which of two private utilities the 
microlearner will be better able to learn. (Indeed, even with the details, making that 
prediction can be nearly impossible.) So it can be extremely difficult to determine what 
private utility intelligence values will accrue to various choices of those private utilities. 
In other words, macrolearning that involves modifying the private utilities to try to 
increase directly intelligence with respect to those utilities can be quite difficult. 

Fortunately we can circumvent many of these difficulties by using a proxy for (private 
utility) intelligence. Although we expect its value usually to be correlated with that of 
intelligence in practice, this proxy does not share intelligence’s non-local nature. In ad- 
dition, the proxy does not depend heavily on the details of the microlearning algorithms 
used, i.e., it is fairly independent of those aspects of C. 

We motivate this proxy by considering having g n — G for all r). If we try to actually 
use these {</,) } as the inicrolearners’ private utilities, particularly if the COIN is large, we 
will invariably encounter a very bad signal-to-noise problem. For this choice of utilities, 
the effects of the actions taken by node r/ on its utility may be “swamped” and effectively 
invisible, since there are so many other processes going into determining G’s value. In 
such a scenario, there is nothing that rfs microlearner can do to reliably achieve high 
intelligence . 10 

One natural way to quantify this effect is as (utility) learnability: Given a measure 
dn(Cn) and manifold C , the utility learnability of a utility U for a node r/ at £ is: 

— ,vJ 

. _ M<0> TO, c;,o» - 1/(01 

“ SMQ TO^TO^i,.,)* - (7(01 ■ 

(Intelligence learnability is defined the same way, with U{.) replaced by Note 

that scaling all utility values by the same overall factor does not affect the value of the 
learnability. 

The integrand in the numerator of the definition of learnability reflects how much 
of the change in U that results from replacing with £' 0 is due to the change in 

10 This “signal-to-noise” problem is actually endemic to reinforcement learning as a whole, even some- 
times occurring when one has just a single reinforcement learner, and only a few random variables jointly 
determining the value of the rewards [254]. 
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t = 0 state (the “signal”). The denominator reflects how much of the change in U that 
results from replacing ( with (' is due to the change in the t = 0 states of nodes other 
than 7 ] (the “noise”). So learnability quantifies how easy it is for the microlearner to 
discern the “echo” of its behavior in the utility function U. Our presumption is that the 
microlearning algorithm will achieve higher intelligence if provided with a more learnable 
private utility. 

Note that a particular value of utility learnability, by itself, has no significance. Simply 
rescaling the units of £ Q will change that value. Rather what is important is the ratio 
of differential learnabilities, at the same C, for different U's. Such a ratio quantifies the 
relative preferability of those U's. 


More generally, learnability is not meant to capture all factors that will affect how 
high an intelligence value a particular microlearner will achieve. This is not possible if 
for no other reason then the fact that there are many such factors that are idiosyncratic 
to the microlearner used. In addition though, certain more general factors affecting 
learning, like the curse of dimensionality, are not explicitly designed into learnability. 
Learnability is not meant to quantify performance — that is what intelligence is designed 
to do. Rather (relative) learnability is meant to provide a guide for how to improve 
performance. 


The (utility) differential learnability at a point £ is the learnability with dfi re- 
stricted to an infinitesimal ball about C- We formalize it as the following ratio of mag- 
nitudes of gradients: 




(0 = 


ii\ o [/ (WC(c 0 ))ii 


( 3 ) 


One nice feature of differential learnability is that unlike learnability, it does not 
depend on choice of some measure d/i ( ■ ) ■ This independence can lead to troubles if 
one is not careful however, and in particular if one uses learnability for purposes than 
choosing between utility functions. For example, in some situations, the COIN designer 
will have the option of enlarging the set of variables from the rest of the COIN that are 
“input” to some node r] and that therefore can be used by r/ to decide what action to take. 
Intuitively, doing so will not affect the RL “signal” for r/’s microlearner (the magnitude 
of the potential “echo” of ifs actions are not modified by changing some aspect of how 
it chooses among those actions). However it will reduce the “noise”. In the full integral 
version of learnability, this effect can be captured by shrinking the support of d/j.(.) to 
reflect the fact that the extra inputs to r] at t = 0 are correlated with the t = 0 state 
of the external system. In differential learnability however this is not possible, precisely 
because no measure dp(.) occurs in differential learnability. So we must capture the 
reduction in noise in some other fashion. 11 

11 An example of how to do so is to replace <9 C U ( C , C(C )) in the definition of differential learn- 
ability with the projection of £/(C * J) onto tangent plane of C , t > o at Assume that 

in addition to the restriction of obeying the dynamic laws C(.) for evolution to times past 1 = 0, the 
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A system that has infinite (differential, intelligence) learnability is said to be “per- 
fectly” (differential, intelligence) learnable. It is straight-forward to prove that a system 
is perfectly learnable V( € C iff Vr/. r/ r ,(() can be written as ip,,{( th0 ) for some function 
-iprii-)- (See the discussion below on the general condition for a system’s being perfectly 
factored.) 

4.3 A descriptive framework for COINs 

With these definitions in hand, we can now present (a portion of) one descriptive frame- 
work for COINs. In this subsection, after discussing salient characteristics in general, 
we present some theorems concerning the relationship between personal utilities and the 
salient characteristic we choose to concentrate on. We then discus how to use those 
theorems to induce that salient characteristic. 

4.3.1 Candidate salient characteristics of a COIN 

The starting point with a descriptive framework is the identification of “salient charac- 
teristics of a COIN which one strongly expects to be associated with its having large 
world utility”. In this chapter we will focus on salient characteristics that concern the re- 
lationship between personal and world utilities. These characteristics are formalizations 
of the intuition that we want COINs in which the competent greedy pursuit of their pri- 
vate utilities by the microlearners results in large world utility, without any bottlenecks, 
TOC, “frustration” (in the spin glass sense) or the like. 

One natural candidate for such a characteristic, related to Pareto optimality (90, 89], 
is weak triviality. It is defined by considering any two worldlines £ and C' both of 
which are consistent with the system’s dynamics (i.e., both of which lie on C) . where for 
every node 77 , g n { C) > g n { C). (An obvious variant is to restrict <' f<0 = C )t<0 > and require 
only that both of the “partial vectors” £' t>0 and ( t>Q obey the relevant dynamical laws, 
and therefore lie in C,t> o-) If for any such pair of worldlines it is necessarily true that 
G(C) > G(C'), we say that the system is weakly trivial. We might expect that systems 
that are weakly trivial for the microlearners’ private utilities are configured correctly for 
inducing large world utility. After all, for such systems, if the microlearners collectively 
change ( in a way that ends up helping all of them, then necessarily the world utility 
also rises. 

As it turns out though, weakly trivial systems can readily evolve to a world utility 
minimum , one that often involves TOC. To see this, consider automobile traffic in the 
absence of any traffic control system. Let each node be a different driver, and say their 
private utilities are how quickly they each individually get to their destination. Identify 

manifold C,t> 0 reflects the restriction on £ (>Q that the extra inputs to 1 / at t — 0 are correlated with 
the t = 0 state of the external system. Under these circumstances, this projection of the gradient of the 
77 components will reduce the noise term in the appropriate fashion. See [250]. 
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world utility as the sum of private utilities. Then by simple additivity, for all ( and 
whether they lie on C or not, if g ri (C) > 9r/{C) it follows that G(C) > (’(C)- the 
system is weakly trivial. However as any driver on a rush-hour freeway with no carpool 
lanes or metering lights can attest, every driver’s pursuing their own goal definitely does 
not result in acceptable throughput for the system as a whole; modifications to private 
utility functions (like fines for violating carpool lanes or metering lights) would result 
in far better global behavior. A system’s being weakly trivial provides no assurances 
regarding world utility. 

The problem with weak triviality is precisely the fact that the individual inicrolearners 
are greedy. In a COIN, there is no system-wide incentive to replace £ with a different 
worldline that would improve everybody’s private utility, as in the definition of weak 
triviality. Rather the incentives apply to each microlearner individually and motivate 
the learners to behave in a way that may well hurt some of them. So weak triviality is, 
upon examination, a poor choice for the salient characteristic of a COIN. 

One alternative to weak triviality follows from considering that we must ‘expect’ 
a salient characteristic to be coupled to large world utility, of the definition of the 
descriptive framework. What can we reasonably assume about a running COIN? We 
cannot assume that all the private utilities will have large values — witness the traffic 
example. But we can assume that if the microlearners are well-designed, each of them 
will be doing close to as well it can given the behavior of the other nodes. In other words, 
within broad limits we can assume that the system is more likely to be in C than Cf if for 
all r/, e v ,g r ,(0 > e T , ,<,,,( 0- We define a system to be coordinated iff for any such ( and 

lying on C, G(0 > G((f ). (Again, an obvious variant is to restrict C' t<0 = C t<0 > ail( i 
require only that both ( (>0 and (' lie in C f >o-) Traffic systems are not coordinated, 
in general. This is evident from the simple fact that if all drivers acted as though there 
were metering lights when in fact there weren’t any, they would each be behaving with 
lower intelligence given the actions of the other drivers (each driver would benefit greatly 
by changing its behavior by no longer pretending there were metering lights, etc.). But 
nonetheless, world utility would be higher. 

4.3.2 The Salient Characteristic of Factoredness 

Like weak triviality, coordination is intimately related to the economics concept of Pareto 
optimality. Unfortunately, there is not room in this chapter to present the mathematics 
associated with coordination and its variants. However there is room to discuss a third 
candidate salient characteristic of COINs, one which like coordination (and unlike weak 
triviality) we can reasonably expect to be associated with large world utility. This 
alternative fixes weak triviality not by replacing the personal utilities { <y, ; } with the 
intelligences {e v ,g v } as coordination does, but rather by only considering worldlines whose 
difference at time 0 involves a single node. This results in its being related to Nash 
equilibria rather than Pareto optimality. 
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Say that our COIN’S worldline is (. Let (' be any other worldline where C f<0 = C^ <0 ’ 
and where £' <E C ( >o- Now restrict attention to those (' where at t = 0 ( and (' differ 
only for node g. If for all such 

sgn[g ^(C) - g v ( C tt<0 i C'iC'o))] = sgn[G(Q - <?(C f<0 , <?(£'(,))] . ( 4 ) 

and if this is true for all nodes g , then we say that the COIN is factored for all those 
utilities (at £, with respect to time 0). 

For a factored system, for any node g, given the rest of the system , if the node’s state 
at t = 0 changes in a way that improves that node’s utility over the rest of time, then 
it necessarily also improves world utility. Colloquially, for a system that is factored for 
a particular microlearner’s private utility, if that learner does something that improves 
that personal utility, then everything else being equal, it has also done something that 
improves world utility. Of two potential microlearners for controlling node g (i.e., two 
potential C^) whose behavior until t — 0 is identical but which disagree there, the mi- 
crolearner that is smarter with respect to g will always result in a larger g, by definition 
of intelligence. Accordingly, for a factored system, the smarter microlearner is also the 
one that results in better G. So as long as we have deployed a sufficiently smart mi- 
crolearner on g , we have assured a good G (given the rest of the system). Formally, this 
is expressed in the fact [250j that for a factored system, for all nodes g, 

e V,9r,(0 = t'r?,c(0 • (5) 

One can also prove that Nash equilibria of a factored system are local maxima of world 
utility. Note that in keeping with our behaviorist perspective, nothing in the definition 
of factored requires private utilities. Indeed, it may well be that a system having private 
utilities { U T/ } is factored, but for personal utilities {g,,} that differ from the {U ri \. 

A system’s being factored does not mean that a change to £ that improves g,,(Q 
cannot also hurt g tl > (() for some g 1 g. Intuitively, for a factored system, the side 
effects on the rest of the system of g's increasing its own utility do not end up decreasing 
world utility — but can have arbitrarily adverse effects on other private utilities. For 
factored systems, the separate microlearners successfully pursuing their separate goals 
do not frustrate each other as far as world utility is concerned. 

In general, we can’t have both perfect learnability and perfect factoredness. As an 
example, say that Vi, Z nt = Z ^ = U. Then if <?(<?(<„)) = C Jj0 x C I/0 and the system 
is perfectly learnable, it is not perfectly factored. This is because C 0 = (O/C,, n f° r 

this case, and therefore perfect learnability requires that € C,g v ( () = Vfy(G(0/Cj, 0 ) 
for some function ^(.). However the partial derivative of this with respect to G will be 
negative for negative C T? 0 * which means the system is actually “anti-factored” for such 
C Due to such incompatibility between perfect factoredness and perfect learnability, 
we must usually be content with having high degree of factoredness and high learnability. 
In such situations, the emphasis of the macrolearning process should be more and more 
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on having high degree of factoredness as we get closer and closer to a Nash equilibrium. 
This way the system won’t relax to an incorrect local maximum. 

In practice of course, a COIN will often not be perfectly factored. Nor in practice 
are we always interested only in whether the system is factored at one particular point 
(rather than across a region say). These issues are discussed in [250], where in particular 
a formal definition of of the degree of factoredness of a system is presented. 

If a system is factored for utilities {^ }, then it is also factored for any utilities {</[,} 
where for each g g' Tj is a monotonically increasing function of g 1} . More generally, the 
following result characterizes the set of all factored personal utilities: 

Theorem 1: A system is factored at all Q € C iff for all those (, Vr 7, we can write 

= ( 6 ) 

for some function $„(.,.,.) such that d G ^ v (C,C <<0 ,G) > 0 fo r all C € C and associated 
G values. (The form of the {g n \ off of C is arbitrary.) 


Proof: For fixed ( r/ 0 and C <<0 , any change to Q which keeps C f>0 on C and which at 
the same time increases G{ Q = G(C <<0 , G((- r;0 , ( r;0 )) must increase ^/?(C f<0 iC r) <?(£)), 
due to the restriction on ^G^^(C , <0 i C rj 0 , G). This establishes the backwards direction 
of the proof. 

For the forward direction, write g n (Q = g, t { C, G( 0) = 9n((j <0 ’ C(C >( , 0 ’ , 0 )’ ^(0) v C e 

C. Define this formulation of g n as $,,(£ j. <0 ’£ 0 iG(C))> which we can re-express as 

Now since the s y stem is factored, e C,V C' t > 0 € C, t > 0, 


^* 7 ( C > t < o ’ ^, 0 ’ £ rj , 0 ’ ^^, 0 ’ *0.11,0 ) ) ) ^ ^, t < 0 ’ —11,0 ’ 0 ’ G ’ 0-^,0 ) ^ ) 




r ;,0 


))) = o(c 1<0 ,c(c, jir C„ 0 )) 


So consider any situation where the system is factored, and both the values of G and 

of C n are specified. Then we can find any ( consistent with those values (i.e., such 

that our provided value of G equals G(C t<0 , C(C^ 0 ,C^ 0 ))), evaluate the resulting value 

of $„(C „,C „,C „ , G ) , and know that we would have gotten the same value if we had 

’> '-L^t <0 7 — ?7, 0 7 —77,0 77 

found a different consistent C Q . This is true for all < £ C. Therefore the mapping 
(C,«<o’C-,,, 0 ’ G ) is single- valued, and we can write ^(C^q’C-^o’ G (C))- Q ed * 


By Thm. 1, we can ensure that the system is factored without any concern for 
C, by having each g„(Q = <F„(C t<0 ,C- 0 , G(C)) V C € Z. Alternatively, by only re- 
quiring that VC G C does g n ( C) = $»?(C jt<0 > C-^ 0 ' G (C)) (i-e-, does 0>j(C Tt<O i C(C 0 )) = 
$,j(C <<0 »C> ? o’ G(C t <o’C(C q))))’ wc can access a broader class of factored utilities, a 
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class that does depend on C . Loosely speaking, for those utilities, we only need the 
projection of d ^ G ( C ) onto to be parallel to the projection of grj(C) ()ll,,() 

C' T/ (). Given G and C, there are infinitely many 9i )( 0 having this projection (the 

set of such <9<; (>o 9rj(C) form a linear subspace of Z).~ The partial differential equations 
expressing the precise relationship are discussed in [250]. 

As an example of the foregoing, consider a “team game” (also known as an “exact 
potential game” [76, 169]) in which g v = G for all g. Such COINs are factored, trivially, 
regardless of C; if g n rises, then G must as well, by definition. (Alternatively, to confirm 
that team games are factored just take $??(C (<0 > C-^ 0 i G) = G Vg in Thin. 1.) On the 
other hand, as discussed below, COINs with ‘wonderful life’ personal utilities are also 
factored, but the definition of such utilities depends on C. 

4.3.3 Wonderful life utility 

Due to their often having poor learnability and requiring centralized communication 
(among other infelicities), in practice team game utilities often are poor choices for 
personal utilities. Accordingly, it is often preferable to use some other set of factored 
utilities. To present an important example, first, define the ( t = 0) effect set of node g 
at C, Cf/f{ a as tho set of all components £ , for which 0^ (C( C 0 ))r/',£ 7^ 0* Define the 

effect set Off? with no specification of £ as (Q. (We take this latter definition 

to be the default meaning of ‘‘effect set”.) Intuitively, rf s effect set is the set of all 
components £ , f which would be affected by a change in the state of node 77 at time 0. 
(They may or may not be affected by changes in the t = 0 states of the other nodes.) 
The extension for times other than 0 is immediate, as is the extension to effect sets that 
consist of subsets of the set of all components f . . rather than of the set of vectors f . 
These extensions will be skipped here though to minimize the number of variables we 
must keep track of. 

Next take the Wonderful Life set a to be a set of components and define 

CL a (C) as the vector £ modified by clamping the cr-coinponents of £ to an arbitrary fixed 
value, here taken to be 0 for all such components. Then the the value of the wonderful 
life utility (WLU for short) for a at £ is: 

WLUA c) = G ( 0 - G(CMO) • (7) 

In particular, the WLU for the effect set of node T] is G(() - G(CL c ,r//(C)), which for 
(eC can be written as G(C t<0 ,G(C 0 )) - G( CL c'' / (!l,t<o , ^(C i0 )))- 

We can view ry’s effect set WLU as analogous to the change in world utility that would 
have arisen if node r; “had never existed”. (Hence the name of this utility - cf. the Frank 
Capra movie.) Note however, that CL is a purely “fictional”, counter-factual operation, 
in the sense that it produces a new £ without taking into account the system’s dynamics. 
Indeed, no assumption is even being made that CL ff (C) is consistent with the dynamics 
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of the system. The sequence of states the node 77 is clamped to in the definition of the 
WLU need not be consistent with the dynamical laws of the system. 

This dynamics-independence is a crucial strength of the WLU. It means that to 
evaluate the WLU we do not try to infer how the system would have evolved if node t/’s 
state were set to 0 at time 0 and the system evolved from there. So long as we know ( 
extending over all time, and so long as we know G, we know the value of WLU. This is 
true even if we know nothing of the dynamics of the system. 

An important example is effect set wonderful life utilities when the set of all nodes 
is partitioned into ‘subworld’ in such a way that all nodes in the same subworld tv share 
substantially the same effect set. In such a situation, all nodes in the same subworld tv 
will have essentially the same personal utilities, exactly as they would if they used team 
game utilities with a “world” given by tv. When all such nodes have large intelligence 
values, this sharing of the personal utility will mean that all nodes in the same subworld 
are acting in a coordinated fashion, loosely speaking. 

The importance of the WLU arises from the following results: 


Theorem 2: A COIN is factored for personal utilities set equal to the associated effect 
set wonderful life utilities. 


Proof: Since CL^e// (( ) is independent of £ , / for all (rf, t) 6 so is the Z vector 

CL c e//(C f< 0 ,C(C 0 ))- I-e., JCL c «//(C t< . 0 ,C'(C 0 ))]» 7 ',t = 0 This, means that 

viewed as a function from C , 0 to Z, CL c <// (( <<0 , C(.)) is a single- valued function of £. Q . 
Therefore G(CL ce //(C , < 0 tC(^ 0 ))) can only depend on C (<0 and the non -77 components 
of ( . Accordingly, the WLU for is just G minus a term that is a function of ( t<0 
and £. By choosing in Tlim. 1 to be that difference, we see that r/’s effect 

set WLU is of the form necessary for the system to be factored. QED. 


More generally, the system is factored if each node r/’s personal utility is (a monotonically 
increasing function of) the WLU for a set that contains C^A 

For conciseness, except where explicity needed, we will suppress the argument “C t<0 ” 
in the rest of this subsection, taking it to be implicit. To understand the potential 
practical advantages of the WLU, we start with the following: 

Theorem 3: Let cr be a set contaiing C^A Then 

KwluAO _ ll%,, 0 G(C(C 0 ))ll _ 

Kg(o 11%, cm 0 )) - %, 0 < J r (CL a (c'(c 0 )))ii ' 
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Proof: Writing it out, 


lia< iio G(C(< 0 ))-a £jio G(CL,(CK 0 )))|| 

\xlvM) - - a s ( rj C(OMC(< 0 i)n| ' 

The second term in the numerator equals 0, by definition of effect set. Dividing by the 
similar expression for then gives the result claimed. QED. 

So if we expect that ratio of magnitudes of gradients to be large, effect set WLU has 
much higher learnability than team game utility — while still being factored, like team 
game utility. As an example, consider the case where the COIN is a very large system, 
with 7] being only a relatively minor part of the system (e.g., a large human economy 
with 7] being a “typical John Doe”). Often in such a system, for the vast majority of 
nodes r/ ^ 77, how G varies with will be essentially independent of the value C^q- 
(E.g., how GDP of the US economy varies with the actions of John Doe in Peoria, Illinois 
will be independent of the state of some Jane Smith in Los Angeles, California.) In such 
circumstances, Thin. 3 tells us that the effect set wonderful life utility for 77 will have a 
far larger learnability than does the world utility. 

For any fixed <r, if we change the clamping operation (i.e., change the choice of the 
“arbitrary fixed value” we clamp each component to), then we change the mapping 
C 0 -» C L ff (C(C 0 )), and therefore change the mapping (< r;0 G >;0 ) -> G(CL <T (C(C 0 )))- 
Accordingly, changing the clamping operation can affect the value of J 3 {CL a ( C(£ () ) ) 

evaluated at some point £ . Therefore, by Thin. 3, changing the clamping operation 
can affect \ r ,,\VLU a (0- So properly speaking, for any choice of a , if we’re going to 
use WLU, r , we should set the clamping operation so as to maximize learnability. For 
simplicity though, in this paper we will ignore this phenomenon, and simply set the 
clamping operation to a more or less “natural” choice. 

Next consider the case where, for some node n, we can write G{( ) as G'i (Gw/ ) + 

— , — k 77 

G 2 (( n , CrW / )■ Say it is also true that 77’s effect set is a small fraction of the set of 
all components. In this case it often true that the values of G(.) are much larger than 
those of Gi(.), which means that partial derivatives of G(.) are much larger than those of 
G i(.). In such situations the effect set WLU is far more learnable than the world utility, 
due to the following results: 

Theorem 4: If for some node 77 there is a set a containing C r ^, a function G € Z^), 
and a function £ Z - a ), such that G(C) = Gi(C^) + G2(C- a )> then 

KwlvM) _ ll%,. 0 G(C(C 0 ))ll 

A„,c(0 ||S(. no G(CL i (GK 0 )))|| ' 


Proof: For brevity, write G 1 and G 2 both as functions of full C Z. just such func- 
tions that are only allowed to depend on the components of £ that lie in a and those 
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components that do not lie in a, respectively. Then the a WLU for node 7/ is just 
g n (Q = Gx(C) - GifCL^O). Since in that second term we’re clamping all the compo- 
nents of C that Gi(.) cares about, for this personal utility d ^ <7»?(C(C „)) = fy 0 Gi(C(^ 0 ))- 
So in particular Q (/r;(C'(C 0 )) = ^^ fl Gi(G(C 0 )) = o G(CL- CT (C(C 0 ))). Now by 
definition of effect set, d Q G 2 (( t<Q , G(C 0 )) = 0, since a does not contain C*/ ! . So 

o))- Q ED * 

The obvious extensions of Thm.’s 3 and 4 for when we’re considering effect sets with 
respect to times other than 0 holds. 

An important special case of Thru. 4 is the following: 

Corollary 1: If for some node rj we can write 

i) G{() = Gi(<^) + G 2 ([C-J*>o) + C 3(C t<0 ) 

for some set a containing C r y C and if 

ii) ||a^ o G(G(C 0 ))|| » ||%. r;0 G 1 ([G(C 0 )WI|, 

then 

^TI,WLU a (Q ^ \,g{0- 

In practice, to assure that condition (i) of this corollary is met might require that cr 
be a proper superset of Ci^ Countervailingly, to assure that condition (ii) is met will 
usually force us to keep a as small as possible. 

4.3.4 Inducing our salient characteristic 

Usually in a descriptive framework our mathematics a formal investigation of the 
salient characteristics — will not provide theorems of the sort, “If you modify the COIN 
the following way at time t, the the value of the world utility will increase.” Rather it 
provides theorems that relate a COIN’S salient characteristics with the general properties 
of the COIN’S entire history, and in particular with those properties embodied in C. 
In particular, the salient characteristic that we are concerned with in this chapter is 
that the system be highly intelligent for personal utilities for which it is factored, and 
our mathematics concerns the relationship between factoredness, intelligence, personal 
utilities, effect sets, and the like. 

More formally, the desideratum associated with our salient characteristic is that we 
want the COIN to be at a £ for which there is some set of {g v } (not necessarily consisting 
of private utilities) such that (a) ( is factored for the {5,,}, and (b) (0 is large for 

all 77. Now there are several ways one might try to induce the COIN to be at such a 
point. One approach is to have each algorithm controlling g explicitly try to “steer” the 
worldline towards such a point. In this approach 77 needn’t even have a private utility in 
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the usual sense. (The overt “goal” of the algorithm controlling 77 involves finding a ( with 
a good associated extremum over the class of all possible <7,,, independent of any private 
utilities.) Now initialization of the COIN, i.e., fixing of involves setting the algorithm 
controlling 77, in this case to the steering algorithm. Accordingly, in this approach in 
initialization we fix £ to a point for which there is some special g,, such that both C is 
factored for g n , and ^,$,(0 is large. There is nothing peculiar about this. What is odd 
though is that in this approach we do not know what that “special” g n is when we do 
that initialization; it’s to be determined, by the unfolding of the system. 

In this chapter we concentrate on a different approach, which can involve either ini- 
tialization or macrolearning. In this alternative we deploy the {</,/ } as the microlearners’ 
private utilities at some t < 0 , in a process not captured in C, so as to induce a factored 
COIN that is as intelligent as possible. (It is with that “deploying of the { g t) }” that we 
are trying to induce our salient characteristic in the COIN.) Since in this approach we 
are using private utilities, we can replace intelligence with its surrogate, learnability. So 
our task is to choose {g v } which are as learnable as possible while still being factored. 

Solving for such utilities can be expressed as solving a set of coupled partial differential 
equations. Those equations involve the tangent plane to the manifold C, a functional 
trading off (the differential versions of) degree of factoredness and learnability, and any 
communication constraints on the nodes we must respect. While there is not space 11 the 
current chapter to present those equations, we can note that they are highly dependent on 
the correlations among the components of . So in this approach, in COIN initialization 
we use some preliminary guesses as to those correlations to set the initial { g tl } . For 
example, the effect set of a node constitutes all components ( f; , (>0 that have non- zero 
correlation with £ . Furthermore, by Thin. 2 the system is factored for effect set WLU 

personal utilities. And by Coroll. 1 , for small effect sets, the effect set WLU has much 
greater differential utility learnability than does G. Extending the reasoning behind this 
result to all ( (or at least all likely Q, we see that for this scenario, the descriptive 
framework advises us to use Wonderful Life private utilities based on (guesses for) the 
associated effect sets rather than the team game private utilities, g v = G V/7. 

In macrolearning we must instead run-time estimate an approximate solution to our 
partial differential equations, based on statistical inference . 12 As an example, we might 
start with an initial guess as to 77’s effect set, and set its private utility to the associated 
WLU. But then as we watch the system run and observe the correlations among the 
components of £, we might modify which components we think comprise r/’s effect set, 
and modify 77’s personal utility accordingly. 

12 Recall that in the physical world, it is often useful to employ devices using algorithms that are 
based on probabilistic concepts, even though the underlying system is ultimately deterministic. (Indeed, 
theological Bayesians invoke a “degree of belief” interpretation of probability to demand such an approach 
— see (?] for a discussion of the legitimacy of this viewpoint.) Similarly, although we take the underlying 
system in a COIN to be deterministic, it is often useful to use microlearners or — as here — macrolearners 
that are based on probabilistic concepts. 
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4.4 Illustrative Simulations of our Descriptive Framework 


As implied above, often one can perform reasonable COIN initialization and/or macrolearn- 
ing without writing down the partial differential equations governing our salient char- 
acteristic explicitly. Simply “hacking” one’s way to the goal of maximizing both degree 
of factoredness and intelligibility, for example by estimating effect sets, often results in 
dramatic improvement in performance. This is illustrated in the experiments recounted 
in the next two subsections. 

4.4.1 COIN Initialization 

Even if we don’t exactly know the effect set of each node 77, often we will be able to make 
a reasonable guess about which components of £ comprise the “preponderance” of 77’s 
effect set. We call such a set a guessed effect set. As an example, often the primary 
effects of changes to rfs state will be on the future state of 77, with only relatively minor 
effects on the future states of other nodes. In such situations, we would expect to still 
get good results if we approximated the effect set WLU of each node 77 with a WLU 
based on the guessed effect set C J;<>0 - other words, we would expect to be able to 
replace WLU,.,// with WLU,- and still get good performance. 

This phenomenon was borne out in the experiments recounted in [255] that used 
COIN initialization for distributed control of network packet routing. In a conventional 
approach to packet routing, each router runs what it believes (based on the information 
available to it) to be a shortest path algorithm (SPA), i.e., each router sends its packets in 
the way that it surmises will get those packets to their destinations most quickly. Unlike 
with a COIN, with SPA-based routing the routers have no concern for the possible 
deleterious side-effects of their routing decisions on the global performance (e.g., they 
have no concern for whether they induce bottlenecks). We ran simulations in which 
we compared a COIN-based routing system to an SPA-based system. For the COIN- 
based system G was global throughput and no macrolearning was used. The COIN 
initialization was to have each router’s private utility be a WLU based on an associated 
guessed effect set generated a priori. In addition, the COIN-based system was realistic 
in that each router’s reinforcement algorithm had imperfect knowledge of the state of the 
system. On the other hand, the SPA was an idealized “best-possible” system, in which 
each router knew exactly what the shortest paths were at any given time. Despite the 
handicap that this disparity imposed on the COIN, it achieved significantly better global 
throughput in our experiments than did the perfect-knowledge SPA-based system. 

The experiments in [255] were primarily concerned with the application of packet- 
routing. To concentrate more precisely on the issue of COIN initialization, we ran 
subsequent experiments on variants of Arthur’s famous “El Farol bar problem” (see 
Section 3). To facilitate the analysis we modified Arthur’s original problem to be more 
general, and since we were not interested in directly comparing our results to those in 
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the literature, we used a more conventional (and arguably “dumber”) machine learning 
algorithm than the ones investigated in [4, 45, 49, 61]. 

In this formulation of the bar problem [256], there are N agents, each of whom picks 
one of seven nights to attend a bar the following week, a process that is then repeated. 
In each week, each agent's pick is determined by its predictions of the associated rewards 
it would receive. These predictions in turn are based solely upon the rewards received 
by the agent in preceding weeks. An agent’s “pick” at week t (i.e., its node’s state at 
that week) is represented as a unary seven-dimensional vector. (See the discussion in the 
definitions subsection of our representing discrete variables as Euclidean variables.) So 
77 ’s zeroing its state in some week, as in the CL^ ^ operation, essentially means it elects 
not to attend any night that week. 

The world utility is 

G(Q = 

t 

where: R{( t ) = Y,l=i lk( x k(C t ))\ x k{ C t ) is total attendance on night k at week 
t\ 7 fc(y) = a^y exp (—y/c); and c and each of the {a*} are real-valued parameters. 
Intuitively, the “world reward” R is the sum of the global “rewards” for each night in 
each week. It reflects the effects in the bar as the attendance profile of agents changes. 
When there are too few agents attending some night, the bar suffers from lack of activity 
and therefore the global reward for that night is low. Conversely, when there are too 
many agents the bar is overcrowded and the reward for that night is again low. Note 
that 'Jk(') reaches its maximum when its argument equals c. 

I 11 these experiments we investigate two different a’s. One treats all nights equally; 
a = [1111111]. The other is only concerned with one night; a = [0 0 0 7 0 0 0]. 
In our experiments, c = 6 and N is chosen to be 4 times larger than the number of 
agents necessary to have c agents attend the bar on each of the seven nights, i.e., there 
are 4 x 6 x 7 = 168 agents (this ensures that there are no trivial solutions and that for 
the world utility to be maximized, the agents have to “cooperate”). 

As explicated below, our microlearning algorithms worked by providing a real- valued 
“reward” signal to each agent at each week t. Each agent’s reward function is a surrogate 
for an associated utility function for that agent. The difference between the two functions 
is that the reward function only reflects the state of the system at one moment in time 
(and therefore is potentially observable), whereas the utility function reflects the agent’s 
ultimate goal, and therefore can depend on the full history of that agent across time. 

We investigated three agent reward functions. With d 7] the night selected by 77 , they 
are: 


Uniform Division (UD): t ) 

Global (G): r v (C t ) 

Wonderful Life (WL): r ??(C f ) 


= i x d v (C . ) ) l x d n (C >( ) 

= fl (C t ) = J2'yk(xk(( t )) 
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= 7<f„(*«/„(C t )) ~ 7d,(®d,(CL^ ( (C t ))) 


The conventional UD reward is a natural “naive” choice for the agents’ reward; the 
total reward on each night gets uniformly divided among the agents attending that night. 
If we take #,,(() = Et r r;(C t ) (i-e., rf s utility is an undiscounted sum of its rewards), then 
for the UD reward G(Q = Et? <7»?(C)i so that the system is weakly trivial. The original 
version of the bar problem in the physics literature [49] is the special case where UD 
reward is used but there are only two “nights” in the week (one of which corresponds to 
“staying at home”); a is uniform; and jki^k) = @{N/2 — x^)- So the reward to agent 
r] is 1 if it attends the bar and attendance is below capacity, or if it stays at home and 
the bar is over capacity. Reward is 0 otherwise. (In addition, unlike in our COIN-based 
systems, in the original work on the bar problem the microlearners work by explicitly 
predicting whether the bar attendance, rather than by directly modifying behavior to 
try to increase a reward signal.) 

In contrast to the UD reward, providing the G reward at time t to each agent results 
in all agents receiving the same reward. For this reward function, the system is auto- 
matically factored if we define g n {Q = Et r 7 ?(C t )- However, evaluation of this reward 
function requires centralized communication concerning all seven nights. Furthermore, 
given that there are 168 agents, G is likely to have poor learnability as a reward for any 
individual agent. 

This latter problem is obviated by using the WL reward, where the subtraction of 
the clamped term removes some of the “noise” of the activity of all other agents, leaving 
only the underlying “signal” of how the agent in question affects the utility. So one 
would expect that with the WL reward the agents can readily discern the effects of their 
actions on their rewards. Even though the conditions in Coroll. 1 don’t hold 13 , this 
reasoning accords with the implicit advice of Coroll. 1 under the approximation of the 
t — 0 effect set as CJf/ f « C t>Q . I.e., it agrees with that corollary’s implicit advice under 
the identification of C t>0 as ifs t — 0 guessed effect set. 

In fact, in this very simple system, we can explicitly calculate the ratio of the WL 
reward’s learnability to that of the G reward, by recasting the system as existing for 
only a single instant so that ? = C Q exactly and then applying Thm. 3. So for 
example, say that all = 1, and that the number of nodes N is evenly divided among 
the seven nights. The numerator term in Thm. 3 is a vector whose components are 
some of the partials of G evaluated when Xk( C 0 ) = N/7. This vector is 7(N — 1) 
dimensional, one dimension for each of the 7 components of (the unary vector comprising) 
each node in r). For any particular if ?/ and night i, the associated partial derivative 
is Efc[e _Ifc( -'° )/c (l - x fc (C Q )/c) x d^ iQ .(zfc(C 0 ))], where as usual indicates the 

i’th component of the unary vector £ , . Since dr (x'jt(C „)) = for any fixed i and 

—V — T.0;t ”“’ u 

13 The t = 0 elements of Ci f f are just £ , but the contributions of £ to G cannot be written as 

a sum of a £ contribution and a £„ contribution. 

t — n — r> / = 0 
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r/, this sum just equals e {_A ' /7c) (1 - JV/7c). Since there are 7{N - 1) such terms, after 
taking the norm we obtain |e( -Ar / 7c ) [1 - N/7c ] \/7(N - 1)|. 

The denominator term in Thm. 3 is the difference between the gradients of the 
global reward and the clamped reward. These differ on only N — 1 terms, one term 
for that component of each node rj' ^ r] corresponding to the night ?/ attends. (The 
other 6 N — 6 terms are identical in the two part.ials and therefore cancel.) This yields 
| e (-jv/7c) [q _ tv/7c] [1 — e 1/,c (l — VN — 1. Combining with the result of the 

previous paragraph, our ratio is \V7 I -11- 

In addition to this learnability advantage of the WL reward, to evaluate its WL 
reward each agent only needs to know the total attendance on the night it attended, 
so no centralized communication is required. Finally, although the system won’t be 
perfectly factored for this reward (since in fact the effect set of 77 ’s action at t would be 
expected to extend a bit beyond C ,)- one might expect that it is close enough to being 
factored to result in large world utility. 


Each agent keeps a seven dimensional Euclidean vector representing its estimate of 
the reward for attending each night of the week. At the end of each week, the component 
of this vector corresponding to the night just attended is proportionally adjusted towards 
the actual reward just received. At the beginning of the succeeding week, the agent picks 
the night to attend using a Boltzmann distribution with energies given by the components 
of the vector of estimated rewards, where the temperature in the Boltzmann distribution 
decays in time. (This learning algorithm is equivalent to Claus and Boutilier’s [56] 
independent learner algorithm for multi-agent reinforcement learning.) We used the 
same parameters (learning rate, Boltzmann temperature, decay rates, etc.) for all three 
reward functions. (This is an extremely primitive RL algorithm which we only chose 
for its pedagogical value; more sophisticated RL algorithms are crucial for eliciting high 
intelligence levels when one is confronted with more complicated learning problems.) 
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Figure 1: Average world reward when a = [0 00700 0] (left) and when 

a = [1111111] (right). In both plots the top curve is WL, middle is G, and 
bottom is UD. 


Figure 1 presents world reward values as a function of time, averaged over 50 separate 
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runs, for all three reward functions, for both a = [1111111] and a = [00 0 700 0]. 
The behavior with the G reward eventually converges to the global optimum. This is 
in agreement with the results obtained by Crites [57] for the bank of elevators control 
problem. Systems using the WL reward also converged to optimal performance. This 
indicates that for the bar problem our approximations of effects sets are sufficiently 
accurate, i.e., that ignoring the effects one agent’s actions will have on future actions of 
other agents does not significantly diminish performance. This reflects the fact that the 
only interactions between agents occurs indirectly, via their affecting each others’ reward 
values. 

However since the WL reward is more learnable than than the G reward, convergence 
with the WL reward should be far quicker than with the G reward. Indeed, when 
a = [0 0 0 7 0 0 0], systems using the G reward converge in 1250 weeks, which is 5 
times worse than the systems using WL reward. When a = [1111111] systems take 
6500 weeks to converge with the G reward, which is more than 30 times worse than the 
time with the WL reward. 


In contrast to the behavior for COIN theory-based reward functions, use of the con- 
ventional UD reward results in very poor world reward values, values that deteriorated 
as the learning progressed. This is an instance of the TOC. For example, for the case 
where a — [0 0 0 7 0 0 0], it is in every agent’s interest to attend the same night — but 
their doing so shrinks the world reward “pie” that must be divided among all agents. 
A similar TOC occurs when a is uniform. This is illustrated in fig. 2 which shows a 
typical example of {^(C*)} f° r each °f the three reward functions for t = 2000. In 
this example optimal performance (achieved with the WL reward) has 6 agents each on 
6 separate nights, and the remaining 132 agents on one night. 
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Figure 2: Typical daily attendance when a = [1 1 1 1 1 1 1] for WL (left), G (center), 
and UD (right). 


Figure 3 shows how t = 2000 performance scales with N for each of the reward signals 
for a = [0 0 0 7 0 0 0]. Systems using the UD reward perform poorly regardless of N. 
Systems using the G reward perform well when N is low. As N increases however, it 
becomes increasingly difficult for the agents to extract the information they need from the 
G reward. (This problem is significantly worse for uniform 5.) Because of their superior 
learnability, systems using the WL reward overcome this signal-to-noise problem (i.e., 
because the WL reward is based on the difference between the actual state and the state 
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Figure 3: Behavior of each reward function with respect to the number of agents for 
a = [0 0 0 7 0 0 0]. 


where one agent is clamped, it is much less affected by the total number of agents). 


4.5 Macrolearning 

In the experiments recounted above the agents’ were sufficiently independent that assum- 
ing they did not affect each other’s actions (when forming guesses for effect sets) allowed 
the resultant WL reward signals to result in optimal performance. In this section we 
investigate the contrasting situation where we have initial guesses of effect sets that are 
quite poor and that therefore result in bad global performance when used with WL re- 
wards. , In particular, we investigate the use of macrolearning to correct those guessed 
effect sets at run-time, so that with the corrected guessed effect sets WL rewards will 
instead give optimal performance. This models real-world scenarios where the system 
designer’s initial guessed effect sets are poor approximations of the actual associated 
effect sets and need to be corrected adaptively. 

In these experiments the bar problem is significantly modified to incorporate con- 
straints designed to result in poor G when the WL reward is used with certain initial 
guessed effect sets. To do this we forced the nights actually attended by some of the 
agents (followers) to agree with those attended by other agents (leaders), regardless 
of what night those followers “picked” via their microlearning algorithms. (For leaders, 
picked and actually attended nights were always the same.) We then had the world utility 
be the sum, over all leaders, of the values of a triply-indexed reward matrix whose indices 
are the the nights that each leader-follower set attends: G(() = J2t Hi Ri,(t),fii(t),f2i(t) 
where li(t) is the night the i th leader attends in week t, and /!,(/) and /2, ( 1 ) are the 
nights attended by the followers of leader i , in week t (in this study, each leader has two 
followers). We also had the states of each node be one of the integers {0, 1, ..., 6} rather 
than (as in the bar problem) a unary seven-dimensional vector. This was a bit of a 
contrivance, since constructions like dr aren’t meaningful for such essentially symbolic 
interpretations of the possible states C Q . But as elaborated below, it was helpful for 
constructing a scenario in which guessed effect set WLU results in poor performance, 
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i.e., a scenario in which we can explore the application of macrolearning. 

To see how this setup can result in poor world utility, first note that the system’s 
dynamics is what restricts all the members of each triple (h{t), /lj(£), /2;(i)) to equal 
the night picked by leader i for week t. So and f2i(t) are both in leader V s actual 

effect set at week t — whereas the initial guess for Vs effect set may or may not contain 
nodes other than li(t ). (E.g., in the bar problem experiments, it does not contain any 
nodes beyond On the other hand, G and R are defined for all possible triples 

(lj(t), f2j(t)). So in particular, R is defined for the dynamically unrealizable 

triples that can arise in the clamping operation. This fact, combined with the leader- 
follower dynamics, means that for certain #’s there exist guessed effect sets such that 
the dynamics assures poor world utility when the associated WL rewards are used. This 
is precisely the type of problem that macrolearning is designed to correct. 

As an example, say each week only contains two nights, 0 and 1. Set Rm = 1 and 
#000 = 0. So the contribution to G when a leader picks night 1 is 1, and when that 
leader picks night 0 it is 0, independent of the picks of that leader’s followers (since the 
actual nights they attend are determined by their leader’s picks). Accordingly, we want 
to have a private utility for each leader that will induce that leader to pick night 1. Now 
if a leader’s guessed effect set includes both of its followers (in addition to the leader 
itself), then clamping all elements in its effect set to 0 results in an R value of i?ooo — 0. 
Therefore the associated guessed effect set WLU will reward the leader for choosing night 
1, which is what we want. (For this case WL reward equals Rm — i?ooo = 1 'if the leader 
picks night 1, compared to reward /?ooo — Rqoq — 0 for picking night 0.) 

However consider having two leaders, i\ and 12 , where i\ ’s guessed effect set consists of 
ii itself together with the two followers of i .2 (rather than together with the two followers 
of i\ itself). So neither of leader *Ys followers are in its guessed effect set, while i\ itself is. 
Accordingly, the three indices to *i’s R need not have the same value. Similarly, clamping 
the nodes in its guessed effect set won’t affect the values of the second and third indices 
to ii’s #, since the values of those indices are set by i^s followers. So for example, if %2 
and its two followers go to night 0 in week 0, and i\ and its two followers go to night 1 in 
that week, then the associated guessed effect set wonderful life reward for i\ for week 0 

is Gf (C tf= o)~ G ( CL ii 1 (0),/li 2 (0),/2 i2 (°)(C )t= o)) = ^ii I (0),/l <1 (0),/2 il (0) + 'Rii 2 (0),/l i2 (0),/2 j2 (0) ~ 
[■^ 0 ,/iij (o),/ 2 i, (o) + ^/ j2 (o),o,o]- This equals i?m + i?ooo — ^011 — Rooo = 1 — ^on- Simply 
by setting #011 < -1 we can ensure that this is negative. Conversely, if leader i\ had 
gone to night 0, its guessed effect WLU would have been 0. So in this situation leader i\ 
will get a greater reward for going to night 0 than for going to night 1. In this situation, 
leader i\'s using its guessed effect set WLU will lead it to make the wrong pick. 

To investigate the efficacy of the macrolearning, two sets of separate experiments 
were conducted. In the first one the reward matrix R was chosen so that if each leader is 
maximizing its WL reward, but for guessed effect sets that contain none of its followers, 
then the system evolves to minimal world reward. So if a leader incorrectly guesses that 
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some a is its effect set even though a doesn’t contain both of that leader’s followers, 
and if this is true for all leaders, then we are assured of worst possible performance. 
In the second set of experiments, we investigated the efficacy of macrolearning for a 
broader spectrum of reward matrices by generating those matrices randomly. We call 
these two kinds of reward matrices worst-case and random reward matrices, respectively. 
In both cases, if it can modify the initial guessed effect sets of the leaders to include their 
followers, then macrolearning will induce the system to be factored. 

The microlearning in these experiments was the same as in the bar problem. All 
experiments used the WL personal reward with some (initially random) guessed effect 
set. When macrolearning was used, it was implemented starting after the microlearning 
had run for a specified number of weeks. The macrolearner worked by estimating the 
correlations between the agents’ selections of which nights to attend. It did this by exam- 
ining the attendances of the agents over the preceding weeks. Given those estimates, for 
each agent 1 7 the two agents whose attendances were estimated to be the most correlated 
with those of agent r/ were put into agent 77 ’s guessed effect set. Of course, none of this 
macrolearning had any effect on global performance when applied to follower agents, but 
the macrolearning algorithm cannot know that ahead of time; it applied this procedure 
to each and every agent in the system. 

Figure 4 presents averages over 50 of world reward as a function of weeks using 
the worst-case reward matrix. For comparison purposes, in both plots the top curve 
represents the case where the followers are in their leader’s guessed effect sets. The 
bottom curve in both plots represents the other extreme where no leader’s guessed effect 
set contains either of its followers. In both plots, the middle curve is performance when 
the leaders' guessed effect sets are initially random, both with (right) and without (left) 
macrolearning turned on at week 500. 

The performance for random guessed effect sets differs only slightly from that of hav- 
ing leaders’ guessed effect sets contain none of their followers; both start with poor values 
of world reward that deteriorates with time. However, when macrolearning is performed 
on systems with initially random guessed effect sets, the system quickly rectifies itself 
and converges to optimal performance. This is reflected by the sudden vertical jump 
through the middle of the right plot at 500 weeks, the point at which macrolearning 
changed the guessed effect sets. By changing those guessed effect sets macrolearning 
results in a system that is factored for the associated WL reward function, so that those 
reward functions quickly induced the maximal possible world reward. 

Figure 5 presents performance averaged over 50 runs for world reward as a function of 
weeks using a spectrum of reward matrices selected at random. The ordering of the plots 
is exactly as in Figure 4. Macrolearning is applied at 2000 weeks, in the right plot. The 
simulations in Figure 5 were lengthened from those in Figure 4 because the convergence 
time of the full spectrum of reward matrices case was longer. 

In figure 5 the macrolearning resulted in a transient degradation in performance at 
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Figure 4: Leader-follower problem with worst case reward matrix. In both plots, every 
follower is in its leader’s guessed effect set in the top curve, no follower is in its leader’s 
guessed effect set in the bottom curve, and followers are randomly asigned to guessed 
effect sets of the leaders in the middle curve. The two plots are without (left) and with 
(right) macrolearning at 500 weeks. 


2000 weeks followed by convergence to the optimal. Without macrolearning the system’s 
performance no longer varied after 2000 weeks. Combined with the results presented in 
Figure 4, these experiments demonstrate that macrolearning induces optimal perfor- 
mance by aligning the agents’ guessed effect sets with those agents that they actually do 
influence the most. 


5 CONCLUSION 

Many distributed computational tasks cannot be addressed by direct modeling of the 
underlying dynamics, or are at best poorly addressed that way due to robustness and 
scalability concerns. Such tasks should instead be addressed by model-independent ma- 
chine learning techniques. In particular, Reinforcement Learning (RL) techniques are 
often a natural choice for how to address such tasks. When — as is often the case — we 
cannot rely on centralized control and communication, such RL algorithms have to be 
deployed locally, throughout the system. 

This raises the important and profound question of how to configure those algorithms, 
and especially their associated utility functions, so as to achieve the (global) computa- 
tional task. In particular we must ensure that the RL algorithms do not “work at 
cross-purposes” as far as the global task is concerned, lest phenomena like tragedy of the 
commons occur. How to do initialize a system to do this is an an entirely novel kind of 
inverse problem, and how to adapt a system at run-time to better achieve such a global 
task is an entirely novel kind of learning problem. We call any distributed computational 
system analyzed from the perspective of such an inverse problem and/or on-learning a 
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Figure 5: Leader- follower problem for random reward matrices. The ordering of the 
plots is exactly as in Figure 4. Macrolearning is applied at 2000 weeks, in the right plot. 

Collective INtelligence (COIN). 

As discussed in the literature review section of this chapter, there are many ap- 
proaches /fields that address aspects of COINs. These range from multi-agent systems 
through conventional economics and on to computational economics. (Human economies 
are a canonical model of a functional COIN. ) They range onward to game theory, various 
aspects of distributed biological systems, and on through physics, active walker models, 
and recurrent neural nets. Unfortunately, none of these fields seems appropriate as a 
general approach to understanding COINs. 

After this literature review we present a mathematical theory for COINs. We then 
present experiments on two test problems that validate the predictions of that theory 
for how best to design a COIN to achieve a global computational task. The first set of 
experiments involves a variant of Arthur’s famous El Farol Bar problem. The second set 
instead considers a leader- follower problem that was hand-designed to cause maximal 
difficulty for the advice of our theory on how to initialize a COIN. This second set of 
experiments was therefore a test of the on-line learning aspect of our approach to COINs. 
In both experiments the procedures derived from our theory, procedures using only local 
information, vastly outperformed natural alternative approaches, even such approaches 
that exploited global information. Indeed, in both problems, following the theory sum- 
marized in this chapter provides good solutions even when the exact conditions required 
by the associated theorems hold only approximately. 

There are many directions in which future work on COINs will proceed; it is a vast and 
rich area of research. We are already successfully applying our current understanding of 
COINs, tentative as it is, to internet packet routing problems. We are also investigating 
COINs in a more general optimization context where economics-inspired market mech- 
anisms are used to guide some of the interactions among the agents of the distributed 
system. The goal in this second body of work is to parallelize and solve numerical opti- 
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mization problems where the concept of an “agent” may not be in the natural definition 
of the problem. We also intend to try to apply our current COIN framework to the prob- 
lem of designing high-occupancy toll lanes in vehicular traffic, and to help understand 
the “design space” necessary for distributed biochemical entities like pre-genomic cells. 
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